47 datasets found

GitHub Programming Languages Data
kaggle.com
zip
Updated Jan 2, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Isaac Wen (2022). GitHub Programming Languages Data [Dataset]. https://www.kaggle.com/datasets/isaacwen/github-programming-languages-data
Explore at:
zip(41198 bytes)Available download formats
Dataset updated
Jan 2, 2022
Authors
Isaac Wen
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Context

A common question for those new and familiar to computer science and software engineering is what is the most best and/or most popular programming language. It is very difficult to give a definitive answer, as there are a seemingly indefinite number of metrics that can define the 'best' or 'most popular' programming language.

One such metric that can be used to define a 'popular' programming language is the number of projects and files that are made using that programming language. As GitHub is the most popular public collaboration and file-sharing platform, analyzing the languages that are used for repositories, PRs, and issues on GitHub and be a good indicator for the popularity of a language.

Content

This dataset contains statistics about the programming languages used for repositories, PRs, and issues on GitHub. The data is from 2011 to 2021.

Source

This data was queried and aggregated from BigQuery's public github_repos and githubarchive datasets.

Limitations

Only data for public GitHub repositories, and their corresponding PRs/issues, have their data available publicly. Thus, this dataset is only based on public repositories, which may not be fully representative of all repositories on GitHub.
d
Data from: A Generic Local Algorithm for Mining Data Streams in Large...
catalog.data.gov
datasets.ai
+2more
Updated Apr 10, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dashlink (2025). A Generic Local Algorithm for Mining Data Streams in Large Distributed Systems [Dataset]. https://catalog.data.gov/dataset/a-generic-local-algorithm-for-mining-data-streams-in-large-distributed-systems
Explore at:
Dataset updated
Apr 10, 2025
Dataset provided by
Dashlink
Description
In a large network of computers or wireless sensors, each of the components (henceforth, peers) has some data about the global state of the system. Much of the system's functionality such as message routing, information retrieval and load sharing relies on modeling the global state. We refer to the outcome of the function (e.g., the load experienced by each peer) as the emph{model} of the system. Since the state of the system is constantly changing, it is necessary to keep the models up-to-date. Computing global data mining models e.g. decision trees, k-means clustering in large distributed systems may be very costly due to the scale of the system and due to communication cost, which may be high. The cost further increases in a dynamic scenario when the data changes rapidly. In this paper we describe a two step approach for dealing with these costs. First, we describe a highly efficient emph{local} algorithm which can be used to monitor a wide class of data mining models. Then, we use this algorithm as a feedback loop for the monitoring of complex functions of the data such as its k-means clustering. The theoretical claims are corroborated with a thorough experimental analysis.
l
LScD (Leicester Scientific Dictionary)
figshare.le.ac.uk
docx
Updated Apr 15, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Neslihan Suzen (2020). LScD (Leicester Scientific Dictionary) [Dataset]. http://doi.org/10.25392/leicester.data.9746900.v3
Explore at:
docxAvailable download formats
Unique identifier
https://doi.org/10.25392/leicester.data.9746900.v3
Dataset updated
Apr 15, 2020
Dataset provided by
University of Leicester
Authors
Neslihan Suzen
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Leicester
Description
LScD (Leicester Scientific Dictionary)April 2020 by Neslihan Suzen, PhD student at the University of Leicester (ns433@leicester.ac.uk/suzenneslihan@hotmail.com)Supervised by Prof Alexander Gorban and Dr Evgeny Mirkes[Version 3] The third version of LScD (Leicester Scientific Dictionary) is created from the updated LSC (Leicester Scientific Corpus) - Version 2*. All pre-processing steps applied to build the new version of the dictionary are the same as in Version 2** and can be found in description of Version 2 below. We did not repeat the explanation. After pre-processing steps, the total number of unique words in the new version of the dictionary is 972,060. The files provided with this description are also same as described as for LScD Version 2 below.* Suzen, Neslihan (2019): LSC (Leicester Scientific Corpus). figshare. Dataset. https://doi.org/10.25392/leicester.data.9449639.v2** Suzen, Neslihan (2019): LScD (Leicester Scientific Dictionary). figshare. Dataset. https://doi.org/10.25392/leicester.data.9746900.v2[Version 2] Getting StartedThis document provides the pre-processing steps for creating an ordered list of words from the LSC (Leicester Scientific Corpus) [1] and the description of LScD (Leicester Scientific Dictionary). This dictionary is created to be used in future work on the quantification of the meaning of research texts. R code for producing the dictionary from LSC and instructions for usage of the code are available in [2]. The code can be also used for list of texts from other sources, amendments to the code may be required.LSC is a collection of abstracts of articles and proceeding papers published in 2014 and indexed by the Web of Science (WoS) database [3]. Each document contains title, list of authors, list of categories, list of research areas, and times cited. The corpus contains only documents in English. The corpus was collected in July 2018 and contains the number of citations from publication date to July 2018. The total number of documents in LSC is 1,673,824.LScD is an ordered list of words from texts of abstracts in LSC.The dictionary stores 974,238 unique words, is sorted by the number of documents containing the word in descending order. All words in the LScD are in stemmed form of words. The LScD contains the following information:1.Unique words in abstracts2.Number of documents containing each word3.Number of appearance of a word in the entire corpusProcessing the LSCStep 1.Downloading the LSC Online: Use of the LSC is subject to acceptance of request of the link by email. To access the LSC for research purposes, please email to ns433@le.ac.uk. The data are extracted from Web of Science [3]. You may not copy or distribute these data in whole or in part without the written consent of Clarivate Analytics.Step 2.Importing the Corpus to R: The full R code for processing the corpus can be found in the GitHub [2].All following steps can be applied for arbitrary list of texts from any source with changes of parameter. The structure of the corpus such as file format and names (also the position) of fields should be taken into account to apply our code. The organisation of CSV files of LSC is described in README file for LSC [1].Step 3.Extracting Abstracts and Saving Metadata: Metadata that include all fields in a document excluding abstracts and the field of abstracts are separated. Metadata are then saved as MetaData.R. Fields of metadata are: List_of_Authors, Title, Categories, Research_Areas, Total_Times_Cited and Times_cited_in_Core_Collection.Step 4.Text Pre-processing Steps on the Collection of Abstracts: In this section, we presented our approaches to pre-process abstracts of the LSC.1.Removing punctuations and special characters: This is the process of substitution of all non-alphanumeric characters by space. We did not substitute the character “-” in this step, because we need to keep words like “z-score”, “non-payment” and “pre-processing” in order not to lose the actual meaning of such words. A processing of uniting prefixes with words are performed in later steps of pre-processing.2.Lowercasing the text data: Lowercasing is performed to avoid considering same words like “Corpus”, “corpus” and “CORPUS” differently. Entire collection of texts are converted to lowercase.3.Uniting prefixes of words: Words containing prefixes joined with character “-” are united as a word. The list of prefixes united for this research are listed in the file “list_of_prefixes.csv”. The most of prefixes are extracted from [4]. We also added commonly used prefixes: ‘e’, ‘extra’, ‘per’, ‘self’ and ‘ultra’.4.Substitution of words: Some of words joined with “-” in the abstracts of the LSC require an additional process of substitution to avoid losing the meaning of the word before removing the character “-”. Some examples of such words are “z-test”, “well-known” and “chi-square”. These words have been substituted to “ztest”, “wellknown” and “chisquare”. Identification of such words is done by sampling of abstracts form LSC. The full list of such words and decision taken for substitution are presented in the file “list_of_substitution.csv”.5.Removing the character “-”: All remaining character “-” are replaced by space.6.Removing numbers: All digits which are not included in a word are replaced by space. All words that contain digits and letters are kept because alphanumeric characters such as chemical formula might be important for our analysis. Some examples are “co2”, “h2o” and “21st”.7.Stemming: Stemming is the process of converting inflected words into their word stem. This step results in uniting several forms of words with similar meaning into one form and also saving memory space and time [5]. All words in the LScD are stemmed to their word stem.8.Stop words removal: Stop words are words that are extreme common but provide little value in a language. Some common stop words in English are ‘I’, ‘the’, ‘a’ etc. We used ‘tm’ package in R to remove stop words [6]. There are 174 English stop words listed in the package.Step 5.Writing the LScD into CSV Format: There are 1,673,824 plain processed texts for further analysis. All unique words in the corpus are extracted and written in the file “LScD.csv”.The Organisation of the LScDThe total number of words in the file “LScD.csv” is 974,238. Each field is described below:Word: It contains unique words from the corpus. All words are in lowercase and their stem forms. The field is sorted by the number of documents that contain words in descending order.Number of Documents Containing the Word: In this content, binary calculation is used: if a word exists in an abstract then there is a count of 1. If the word exits more than once in a document, the count is still 1. Total number of document containing the word is counted as the sum of 1s in the entire corpus.Number of Appearance in Corpus: It contains how many times a word occurs in the corpus when the corpus is considered as one large document.Instructions for R CodeLScD_Creation.R is an R script for processing the LSC to create an ordered list of words from the corpus [2]. Outputs of the code are saved as RData file and in CSV format. Outputs of the code are:Metadata File: It includes all fields in a document excluding abstracts. Fields are List_of_Authors, Title, Categories, Research_Areas, Total_Times_Cited and Times_cited_in_Core_Collection.File of Abstracts: It contains all abstracts after pre-processing steps defined in the step 4.DTM: It is the Document Term Matrix constructed from the LSC[6]. Each entry of the matrix is the number of times the word occurs in the corresponding document.LScD: An ordered list of words from LSC as defined in the previous section.The code can be used by:1.Download the folder ‘LSC’, ‘list_of_prefixes.csv’ and ‘list_of_substitution.csv’2.Open LScD_Creation.R script3.Change parameters in the script: replace with the full path of the directory with source files and the full path of the directory to write output files4.Run the full code.References[1]N. Suzen. (2019). LSC (Leicester Scientific Corpus) [Dataset]. Available: https://doi.org/10.25392/leicester.data.9449639.v1[2]N. Suzen. (2019). LScD-LEICESTER SCIENTIFIC DICTIONARY CREATION. Available: https://github.com/neslihansuzen/LScD-LEICESTER-SCIENTIFIC-DICTIONARY-CREATION[3]Web of Science. (15 July). Available: https://apps.webofknowledge.com/[4]A. Thomas, "Common Prefixes, Suffixes and Roots," Center for Development and Learning, 2013.[5]C. Ramasubramanian and R. Ramya, "Effective pre-processing activities in text mining using improved porter’s stemming algorithm," International Journal of Advanced Research in Computer and Communication Engineering, vol. 2, no. 12, pp. 4536-4538, 2013.[6]I. Feinerer, "Introduction to the tm Package Text Mining in R," Accessible en ligne: https://cran.r-project.org/web/packages/tm/vignettes/tm.pdf, 2013.
English Wikipedia People Dataset
kaggle.com
zip
Updated Jul 31, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Wikimedia (2025). English Wikipedia People Dataset [Dataset]. https://www.kaggle.com/datasets/wikimedia-foundation/english-wikipedia-people-dataset
Explore at:
zip(4293465577 bytes)Available download formats
Dataset updated
Jul 31, 2025
Dataset provided by
Wikimedia Foundationhttp://www.wikimedia.org/
Authors
Wikimedia
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
Summary

This dataset contains biographical information derived from articles on English Wikipedia as it stood in early June 2024. It was created as part of the Structured Contents initiative at Wikimedia Enterprise and is intended for evaluation and research use.

The beta sample dataset is a subset of the Structured Contents Snapshot focusing on people with infoboxes in EN wikipedia; outputted as json files (compressed in tar.gz).

We warmly welcome any feedback you have. Please share your thoughts, suggestions, and any issues you encounter on the discussion page for this dataset here on Kaggle.

Data Structure

File name: wme_people_infobox.tar.gz

Size of compressed file: 4.12 GB

Size of uncompressed file: 21.28 GB

Noteworthy Included Fields: - name - title of the article. - identifier - ID of the article. - image - main image representing the article's subject. - description - one-sentence description of the article for quick reference. - abstract - lead section, summarizing what the article is about. - infoboxes - parsed information from the side panel (infobox) on the Wikipedia article. - sections - parsed sections of the article, including links. Note: excludes other media/images, lists, tables and references or similar non-prose sections.

The Wikimedia Enterprise Data Dictionary explains all of the fields in this dataset.

Stats

Infoboxes - Compressed: 2GB - Uncompressed: 11GB

Infoboxes + sections + short description - Size of compressed file: 4.12 GB - Size of uncompressed file: 21.28 GB

Article analysis and filtering breakdown: - total # of articles analyzed: 6,940,949 - # people found with QID: 1,778,226 - # people found with Category: 158,996 - people found with Biography Project: 76,150 - Total # of people articles found: 2,013,372 - Total # people articles with infoboxes: 1,559,985 End stats - Total number of people articles in this dataset: 1,559,985 - that have a short description: 1,416,701 - that have an infobox: 1,559,985 - that have article sections: 1,559,921

This dataset includes 235,146 people articles that exist on Wikipedia but aren't yet tagged on Wikidata as instance of:human.

Maintenance and Support

This dataset was originally extracted from the Wikimedia Enterprise APIs on June 5, 2024. The information in this dataset may therefore be out of date. This dataset isn't being actively updated or maintained, and has been shared for community use and feedback. If you'd like to retrieve up-to-date Wikipedia articles or data from other Wikiprojects, get started with Wikimedia Enterprise's APIs

Initial Data Collection and Normalization

The dataset is built from the Wikimedia Enterprise HTML “snapshots”: https://enterprise.wikimedia.com/docs/snapshot/ and focuses on the Wikipedia article namespace (namespace 0 (main)).

Who are the source language producers?

Wikipedia is a human generated corpus of free knowledge, written, edited, and curated by a global community of editors since 2001. It is the largest and most accessed educational resource in history, accessed over 20 billion times by half a billion people each month. Wikipedia represents almost 25 years of work by its community; the creation, curation, and maintenance of millions of articles on distinct topics. This dataset includes the biographical contents of English Wikipedia language editions: English https://en.wikipedia.org/, written by the community.

Attribution

Terms and conditions

Wikimedia Enterprise provides this dataset under the assumption that downstream users will adhere to the relevant free culture licenses when the data is reused. In situations where attribution is required, reusers should identify the Wikimedia project from which the content was retrieved as the source of the content. Any attribution should adhere to Wikimedia’s trademark policy (available at https://foundation.wikimedia.org/wiki/Trademark_policy) and visual identity guidelines (ava...
Data Dictionary.xlsx
kaggle.com
zip
Updated Jan 22, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Himanshu Bhardwaj (2021). Data Dictionary.xlsx [Dataset]. https://www.kaggle.com/himanshu911/data-dictionaryxlsx
Explore at:
zip(38618 bytes)Available download formats
Dataset updated
Jan 22, 2021
Authors
Himanshu Bhardwaj
Description
Dataset

This dataset was created by Himanshu Bhardwaj

Contents
Z
Data from: Discrete Dataset - A dataset for computing with operators defined...
data.niaid.nih.gov
Updated Nov 22, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Munar, Marc; Couceiro, Miguel; Massanet, Sebastia; Ruiz-Aguilera, Daniel (2023). Discrete Dataset - A dataset for computing with operators defined on a finite chain [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_10184481
Explore at:
Dataset updated
Nov 22, 2023
Dataset provided by
University of the Balearic Islands
Universitat de les Illes Balears
Laboratoire Lorrain de Recherche en Informatique et ses Applications
Authors
Munar, Marc; Couceiro, Miguel; Massanet, Sebastia; Ruiz-Aguilera, Daniel
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Discrete Dataset is a collection of the main operators defined on the finite chain L_n={0,1,...,n} up to n=11. These operators have been computationally generated, with the aim of being used to study properties of the operators. The easiest way to use these operators is with the Python package "DiscreteFuzzyOperators", published at https://zenodo.org/doi/10.5281/zenodo.5031268. It currently contains the following operators:

For n=1,2,3,4, it contains:

Discrete aggregation functions. Smooth discrete aggregation functions. Smooth and commutative discrete aggregation functions. Discrete conjunctions. Discrete disjunctions. Discrete t-norms. Discrete t-conorms. Discrete uninorms. Discrete negations. For n=5,6, it contains:

Smooth and commutative discrete aggregation functions. Discrete conjunctions. Discrete disjunctions. Discrete t-norms. Discrete t-conorms. Discrete uninorms. Discrete negations. For n=7,8, it contains:

Discrete t-norms. Discrete t-conorms. Discrete uninorms. Discrete negations. For n=9,10,11, it contains:

Discrete t-norms. Discrete t-conorms. Discrete negations.
f
UoS Buildings Image Dataset for Computer Vision Algorithms
salford.figshare.com
application/x-gzip
Updated Jan 23, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ali Alameer; Mazin Al-Mosawy (2025). UoS Buildings Image Dataset for Computer Vision Algorithms [Dataset]. http://doi.org/10.17866/rd.salford.20383155.v2
Explore at:
application/x-gzipAvailable download formats
Unique identifier
https://doi.org/10.17866/rd.salford.20383155.v2
Dataset updated
Jan 23, 2025
Dataset provided by
University of Salford
Authors
Ali Alameer; Mazin Al-Mosawy
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The dataset for this project is represented by photos, photos for the buildings of the University of Salford, these photos are taken by a mobile phone camera from different angels and different distances , even though this task sounds so easy but it encountered some challenges, these challenges are summarized below: 1. Obstacles. a. Fixed or unremovable objects. When taking several photos for a building or a landscape from different angels and directions ,there are some of these angels blocked by a form of a fixed object such as trees and plants, light poles, signs, statues, cabins, bicycle shades, scooter stands, generators/transformers, construction barriers, construction equipment and any other service equipment so it is unavoidable to represent some photos without these objects included, this will raise 3 questions. - will these objects confuse the model/application we intend to create meaning will that obstacle prevent the model/application from identifying the designated building? - Or will the photos be more precise with these objects and provide the capability for the model/application to identify these building with these obstacles included? - How far is the maximum length for detection? In other words, how far will the mobile device with the application be from the building so it could or could not detect the designated building? b. Removable and moving objects. - Any University is crowded with staff and students especially in the rush hours of the day so it is hard for some photos to be taken without a personnel appearing in that photo in a certain time period of the day. But, due to privacy issues and showing respect to that person, these photos are better excluded. - Parked vehicles, trollies and service equipment can be an obstacle and might appear in these images as well as it can block access to some areas which an image from a certain angel cannot be obtained. - Animals, like dogs, cats, birds or even squirrels cannot be avoided in some photos which are entitled to the same questions above.
2. Weather. In a deep learning project, more data means more accuracy and less error, at this stage of our project it was agreed to have 50 photos per building but we can increase the number of photos for more accurate results but due to the limitation of time for this project it was agreed for 50 per building only. these photos were taken on cloudy days and to expand our work on this project (as future works and recommendations). Photos on sunny, rainy, foggy, snowy and any other weather condition days can be included. Even photos in different times of the day can be included such as night, dawn, and sunset times. To provide our designated model with all the possibilities to identify these buildings in all available circumstances.

The selected buildings. It was agreed to select 10 buildings only from the University of Salford buildings for this project with at least 50 images per building, these selected building for this project with the number of images taken are:

Chapman: 74 images

Clifford Whitworth Library: 60 images

Cockcroft: 67 images

Maxwell: 80 images

Media City Campus: 92 images

New Adelphi: 93 images

New Science, Engineering & Environment: 78 images

Newton: 92 images

Sports Centre: 55 images

University House: 60 images Peel building is an important figure of the University of Salford due to its distinct and amazing exterior design but unfortunately it was excluded from the selection due to some maintenance activities at the time of collecting the photos for this project as it is partially covered with scaffolding and a lot of movement by personnel and equipment. If the supervisor suggests that this will be another challenge to include in the project then, it is mandatory to collect its photos. There are many other buildings in the University of Salford and again to expand our project in the future, we can include all the buildings of the University of Salford. The full list of buildings of the university can be reviewed by accessing an interactive map on: www.salford.ac.uk/find-us

Expand Further. This project can be improved furthermore with so many capabilities, again due to the limitation of time given to this project , these improvements can be implemented later as future works. In simple words, this project is to create an application that can display the building’s name when pointing a mobile device with a camera to that building. Future featured to be added: a. Address/ location: this will require collection of additional data which is the longitude and latitude of each building included or the post code which will be the same taking under consideration how close these buildings appear on the interactive map application such as Google maps, Google earth or iMaps. b. Description of the building: what is the building for, by which school is this building occupied? and what facilities are included in this building? c. Interior Images: all the photos at this stage were taken for the exterior of the buildings, will interior photos make an impact on the model/application for example, if the user is inside newton or chapman and opens the application, will the building be identified especially the interior of these buildings have a high level of similarity for the corridors, rooms, halls, and labs? Will the furniture and assets will be as obstacles or identification marks? d. Directions to a specific area/floor inside the building: if the interior images succeed with the model/application, it would be a good idea adding a search option to the model/application so it can guide the user to a specific area showing directions to that area, for example if the user is inside newton building and searches for lab 141 it will direct him to the first floor of the building with an interactive arrow that changes while the user is approaching his destination. Or, if the application can identify the building from its interior, a drop down list will be activated with each floor of this building, for example, if the model/application identifies Newton building, the drop down list will be activated and when pressing on that drop down list it will represent interactive tabs for each floor of the building, selecting one of the floors by clicking on its tab will display the facilities on that floor for example if the user presses on floor 1 tab, another screen will appear displaying which facilities are on that floor. Furthermore, if the model/application identifies another building, it should activate a different number of floors as buildings differ in the number of floors from each other. this feature can be improved with a voice assistant that can direct the user after he applies his search (something similar to the voice assistant in Google maps but applied to the interior of the university’s buildings. e. Top View: if a drone with a camera can be afforded, it can provide arial images and top views for the buildings that can be added to the model/application but these images can be similar to the interior images situation , the buildings can be similar to each other from the top with other obstacles included like water tanks and AC units.

Other Questions:

Will the model/application be reproducible? the presumed answer for this question should be YES, IF, the model/application will be fed with the proper data (images) such as images of restaurants, schools, supermarkets, hospitals, government facilities...etc.
D
COCO-style geographically unbiased image dataset for computer vision...
dataverse.ird.fr
pdf, txt, zip
Updated Jan 13, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Theophile Bayet; Theophile Bayet (2023). COCO-style geographically unbiased image dataset for computer vision applications [Dataset]. http://doi.org/10.23708/N2UY4C
Explore at:
zip(176316624), zip(218991), pdf(57252), txt(1731), pdf(83345), zip(308454)Available download formats
Unique identifier
https://doi.org/10.23708/N2UY4C
Dataset updated
Jan 13, 2023
Dataset provided by
DataSuds
Authors
Theophile Bayet; Theophile Bayet
License
https://dataverse.ird.fr/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.23708/N2UY4Chttps://dataverse.ird.fr/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.23708/N2UY4C
Time period covered
Jan 1, 2022 - Apr 1, 2022
Description
There are already a lot of datasets linked to computer vision tasks (Imagenet, MS COCO, Pascal VOC, OpenImages, and numerous others), but they all suffer from important bias. One bias of significance for us is the data origin: most datasets are composed of data coming from developed countries. Facing this situation, and the need of data with local context in developing countries, we try here to adapt common data generation process to inclusive data, meaning data drawn from locations and cultural context that are unseen or poorly represented. We chose to replicate MS COCO's data generation process, as it is well documented and easy to implement. Data was collected from January to April 2022 through Flickr platform. This dataset contains the results of our data collection process, as follows : 23 text files containing comma separated URLs for each of the 23 geographic zones identified in the UN M49 norm. These text files are named according to the names of the geographic zones they cover. Annotations for 400 images per geographic zones. Those annotations are COCO-style, and inform on the presence or absence of 91 categories of objects or concepts on the images. They are shared in a JSON format. Licenses for the 400 annotations per geographic zones, based on the original licenses of the data and specified per image. Those licenses are shared under CSV format. A document explaining the objectives and methodology underlying the data collection, also describing the different components of the dataset.
Dataset for: Some Remarks on the R2 for Clustering
wiley.figshare.com
txt
Updated Jun 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nicola Loperfido; Thaddeus Tarpey (2023). Dataset for: Some Remarks on the R2 for Clustering [Dataset]. http://doi.org/10.6084/m9.figshare.6124508.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.6124508.v1
Dataset updated
Jun 1, 2023
Dataset provided by
Wileyhttps://www.wiley.com/
Authors
Nicola Loperfido; Thaddeus Tarpey
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
A common descriptive statistic in cluster analysis is the $R^2$ that measures the overall proportion of variance explained by the cluster means. This note highlights properties of the $R^2$ for clustering. In particular, we show that generally the $R^2$ can be artificially inflated by linearly transforming the data by ``stretching'' and by projecting. Also, the $R^2$ for clustering will often be a poor measure of clustering quality in high-dimensional settings. We also investigate the $R^2$ for clustering for misspecified models. Several simulation illustrations are provided highlighting weaknesses in the clustering $R^2$, especially in high-dimensional settings. A functional data example is given showing how that $R^2$ for clustering can vary dramatically depending on how the curves are estimated.
An Ontology of Database Course
figshare.com
xml
Updated Aug 29, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nur W. Rahayu (2023). An Ontology of Database Course [Dataset]. http://doi.org/10.6084/m9.figshare.24046299.v1
Explore at:
xmlAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.24046299.v1
Dataset updated
Aug 29, 2023
Dataset provided by
figshare
Figsharehttp://figshare.com/
Authors
Nur W. Rahayu
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The relational database course forms a crucial part of computing studies. Right now, we design this ontology model for learning and teaching about relational database. This model covers important stuff like data modeling, database systems, SQL data definition, and SQL queries.
P
Dataset for Erasable Itemset Mining
opendata.pku.edu.cn
Updated Nov 19, 2015
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Peking University Open Research Data Platform (2015). Dataset for Erasable Itemset Mining [Dataset]. http://doi.org/10.18170/DVN/ISHFQX
Explore at:
text/plain; charset=us-ascii(5336007), text/plain; charset=us-ascii(9764947), text/plain; charset=us-ascii(7000387)Available download formats
Unique identifier
https://doi.org/10.18170/DVN/ISHFQX
Dataset updated
Nov 19, 2015
Dataset provided by
Peking University Open Research Data Platform
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
These three artificial datasets are for mining erasable itemset. The definition of erasable itemset is in the following reference papers. Note that the three data sets all include 200 different items. But for each item, we did not give the profit value of it. Users can generate as they require, with normal or randomly distribution.
WiDS data dictionary v2.xlsx
kaggle.com
zip
Updated Feb 13, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
VivekSingh (2018). WiDS data dictionary v2.xlsx [Dataset]. https://www.kaggle.com/viveksinghub/wids-data-dictionary-v2xlsx
Explore at:
zip(121001 bytes)Available download formats
Dataset updated
Feb 13, 2018
Authors
VivekSingh
Description
Dataset

This dataset was created by VivekSingh

Released under Data files © Original Authors

Contents
Machine Learning Awards
kaggle.com
zip
Updated Nov 17, 2016
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Internet Association (2016). Machine Learning Awards [Dataset]. https://www.kaggle.com/InternetAssociation/machinelearningawards
Explore at:
zip(15806 bytes)Available download formats
Dataset updated
Nov 17, 2016
Dataset authored and provided by
Internet Association
License
http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/
Description
This dataset captures Kaggle machine learning competitions over time by project type, host-organization classification, and host-organization headquartered states. Data extraction and analysis were done by the Internet Association.

The following variables are included in the dataset:

start_date: Start date of the competition

end_date: End date of the competition

comp_org_conf: Host organization, company, or conference

primary_us_host: Primary host organization or company if the competition is sponsored by a conference or multiple hosts.

host_type: Private, nonprofit, or government

NAICS_code: 6 digit NAICS classification

NAICS: Definition of the 6 digit NAICS classification

hq_in_us: 1 - Yes, primary host is headquartered in US. 0 - No, host is not headquartered in US.

hq: Headquartered state of primary host

two_digit_definition: First 2 digit NAICS definition

three_digit_definition: First 3 digit NAICS definition

project_type: A classification of project based on project description

subtopic: Subtopic of the project type

project_title: Title of the competition

description: A brief description of the competition

prize: Prizes in US dollars

NAICS.link: link to NAICS code

Source: Internet Association. 2016. Machine Learning Awards. District of Columbia: Internet Association [producer]. Washington, DC: Internet Association. San Francisco, CA: Kaggle [distributor]. Web. 4 November 2016.
Z
Research Data Services in US Higher Education: A Web-Based Inventory - Raw...
data.niaid.nih.gov
data-staging.niaid.nih.gov
Updated Nov 18, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Radecki, Jane; Springer, Rebecca (2020). Research Data Services in US Higher Education: A Web-Based Inventory - Raw Data [Dataset]. https://data.niaid.nih.gov/resources?id=ZENODO_4270331
Explore at:
Dataset updated
Nov 18, 2020
Dataset provided by
Ithaka S+R
Authors
Radecki, Jane; Springer, Rebecca
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset represents an inventory of research data services at 120 US colleges and universities. The data was collected using a systematic web content analysis process in late 2019. This dataset underlies the following report: Jane Radecki and Rebecca Springer, "Research Data Services in US Higher Education: A Web-Based Inventory," Ithaka S+R, Nov. 2020, https://doi.org/10.18665/sr.314397.

We defined research data services as any concrete, programmatic offering intended to support researchers (including faculty, postdoctoral researchers, and graduate students) in working with data, and identified services within the following campus units: library, IT department/research computing, independent research centers and facilities, academic departments, medical school, business school, and other professional schools. We also recorded whether the institution offered local high performance computing facilities. For detailed definitions, exclusions, and data collection procedures, please see the report referenced above.
n
Amazon Web Services Public Data Sets
neuinfo.org
dknet.org
+1more
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Amazon Web Services Public Data Sets [Dataset]. http://identifiers.org/RRID:SCR_006318
Explore at:
Unique identifier
https://identifiers.org/RRID:SCR_006318
Description
A multidisciplinary repository of public data sets such as the Human Genome and US Census data that can be seamlessly integrated into AWS cloud-based applications. AWS is hosting the public data sets at no charge for the community. Anyone can access these data sets from their Amazon Elastic Compute Cloud (Amazon EC2) instances and start computing on the data within minutes. Users can also leverage the entire AWS ecosystem and easily collaborate with other AWS users. If you have a public domain or non-proprietary data set that you think is useful and interesting to the AWS community, please submit a request and the AWS team will review your submission and get back to you. Typically the data sets in the repository are between 1 GB to 1 TB in size (based on the Amazon EBS volume limit), but they can work with you to host larger data sets as well. You must have the right to make the data freely available.
d
Replication Data for: \"A Topic-based Segmentation Model for Identifying...
search.dataone.org
Updated Sep 25, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kim, Sunghoon; Lee, Sanghak; McCulloch, Robert (2024). Replication Data for: \"A Topic-based Segmentation Model for Identifying Segment-Level Drivers of Star Ratings from Unstructured Text Reviews\" [Dataset]. http://doi.org/10.7910/DVN/EE3DE2
Explore at:
Unique identifier
https://doi.org/10.7910/DVN/EE3DE2
Dataset updated
Sep 25, 2024
Dataset provided by
Harvard Dataverse
Authors
Kim, Sunghoon; Lee, Sanghak; McCulloch, Robert
Description
We provide instructions, codes and datasets for replicating the article by Kim, Lee and McCulloch (2024), "A Topic-based Segmentation Model for Identifying Segment-Level Drivers of Star Ratings from Unstructured Text Reviews." This repository provides a user-friendly R package for any researchers or practitioners to apply A Topic-based Segmentation Model with Unstructured Texts (latent class regression with group variable selection) to their datasets. First, we provide a R code to replicate the illustrative simulation study: see file 1. Second, we provide the user-friendly R package with a very simple example code to help apply the model to real-world datasets: see file 2, Package_MixtureRegression_GroupVariableSelection.R and Dendrogram.R. Third, we provide a set of codes and instructions to replicate the empirical studies of customer-level segmentation and restaurant-level segmentation with Yelp reviews data: see files 3-a, 3-b, 4-a, 4-b. Note, due to the dataset terms of use by Yelp and the restriction of data size, we provide the link to download the same Yelp datasets (https://www.kaggle.com/datasets/yelp-dataset/yelp-dataset/versions/6). Fourth, we provided a set of codes and datasets to replicate the empirical study with professor ratings reviews data: see file 5. Please see more details in the description text and comments of each file. [A guide on how to use the code to reproduce each study in the paper] 1. Full codes for replicating Illustrative simulation study.txt -- [see Table 2 and Figure 2 in main text]: This is R source code to replicate the illustrative simulation study. Please run from the beginning to the end in R. In addition to estimated coefficients (posterior means of coefficients), indicators of variable selections, and segment memberships, you will get dendrograms of selected groups of variables in Figure 2. Computing time is approximately 20 to 30 minutes 3-a. Preprocessing raw Yelp Reviews for Customer-level Segmentation.txt: Code for preprocessing the downloaded unstructured Yelp review data and preparing DV and IVs matrix for customer-level segmentation study. 3-b. Instruction for replicating Customer-level Segmentation analysis.txt -- [see Table 10 in main text; Tables F-1, F-2, and F-3 and Figure F-1 in Web Appendix]: Code for replicating customer-level segmentation study with Yelp data. You will get estimated coefficients (posterior means of coefficients), indicators of variable selections, and segment memberships. Computing time is approximately 3 to 4 hours. 4-a. Preprocessing raw Yelp reviews_Restaruant Segmentation (1).txt: R code for preprocessing the downloaded unstructured Yelp data and preparing DV and IVs matrix for restaurant-level segmentation study. 4-b. Instructions for replicating restaurant-level segmentation analysis.txt -- [see Tables 5, 6 and 7 in main text; Tables E-4 and E-5 and Figure H-1 in Web Appendix]: Code for replicating restaurant-level segmentation study with Yelp. you will get estimated coefficients (posterior means of coefficients), indicators of variable selections, and segment memberships. Computing time is approximately 10 to 12 hours. [Guidelines for running Benchmark models in Table 6] Unsupervised Topic model: 'topicmodels' package in R -- after determining the number of topics(e.g., with 'ldatuning' R package), run 'LDA' function in the 'topicmodels'package. Then, compute topic probabilities per restaurant (with 'posterior' function in the package) which can be used as predictors. Then, conduct prediction with regression Hierarchical topic model (HDP): 'gensimr' R package -- 'model_hdp' function for identifying topics in the package (see https://radimrehurek.com/gensim/models/hdpmodel.html or https://gensimr.news-r.org/). Supervised topic model: 'lda' R package -- 'slda.em' function for training and 'slda.predict' for prediction. Aggregate regression: 'lm' default function in R. Latent class regression without variable selection: 'flexmix' function in 'flexmix' R package. Run flexmix with a certain number of segments (e.g., 3 segments in this study). Then, with estimated coefficients and memberships, conduct prediction of dependent variable per each segment. Latent class regression with variable selection: 'Unconstraind_Bayes_Mixture' function in Kim, Fong and DeSarbo(2012)'s package. Run the Kim et al's model (2012) with a certain number of segments (e.g., 3 segments in this study). Then, with estimated coefficients and memberships, we can do prediction of dependent variables per each segment. The same R package ('KimFongDeSarbo2012.zip') can be downloaded at: https://sites.google.com/scarletmail.rutgers.edu/r-code-packages/home 5. Instructions for replicating Professor ratings review study.txt -- [see Tables G-1, G-2, G-4 and G-5, and Figures G-1 and H-2 in Web Appendix]: Code to replicate the Professor ratings reviews study. Computing time is approximately 10 hours. [A list of the versions of R, packages, and computer...
2025 Kaggle Machine Learning & Data Science Survey
kaggle.com
Updated Jan 28, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hina Ismail (2025). 2025 Kaggle Machine Learning & Data Science Survey [Dataset]. https://www.kaggle.com/datasets/sonialikhan/2025-kaggle-machine-learning-and-data-science-survey
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jan 28, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Hina Ismail
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Overview Welcome to Kaggle's second annual Machine Learning and Data Science Survey ― and our first-ever survey data challenge.

This year, as last year, we set out to conduct an industry-wide survey that presents a truly comprehensive view of the state of data science and machine learning. The survey was live for one week in October, and after cleaning the data we finished with 23,859 responses, a 49% increase over last year!

There's a lot to explore here. The results include raw numbers about who is working with data, what’s happening with machine learning in different industries, and the best ways for new data scientists to break into the field. We've published the data in as raw a format as possible without compromising anonymization, which makes it an unusual example of a survey dataset.

Challenge This year Kaggle is launching the first Data Science Survey Challenge, where we will be awarding a prize pool of $28,000 to kernel authors who tell a rich story about a subset of the data science and machine learning community..

In our second year running this survey, we were once again awed by the global, diverse, and dynamic nature of the data science and machine learning industry. This survey data EDA provides an overview of the industry on an aggregate scale, but it also leaves us wanting to know more about the many specific communities comprised within the survey. For that reason, we’re inviting the Kaggle community to dive deep into the survey datasets and help us tell the diverse stories of data scientists from around the world.

The challenge objective: tell a data story about a subset of the data science community represented in this survey, through a combination of both narrative text and data exploration. A “story” could be defined any number of ways, and that’s deliberate. The challenge is to deeply explore (through data) the impact, priorities, or concerns of a specific group of data science and machine learning practitioners. That group can be defined in the macro (for example: anyone who does most of their coding in Python) or the micro (for example: female data science students studying machine learning in masters programs). This is an opportunity to be creative and tell the story of a community you identify with or are passionate about!

Submissions will be evaluated on the following:

Composition - Is there a clear narrative thread to the story that’s articulated and supported by data? The subject should be well defined, well researched, and well supported through the use of data and visualizations. Originality - Does the reader learn something new through this submission? Or is the reader challenged to think about something in a new way? A great entry will be informative, thought provoking, and fresh all at the same time. Documentation - Are your code, and kernel, and additional data sources well documented so a reader can understand what you did? Are your sources clearly cited? A high quality analysis should be concise and clear at each step so the rationale is easy to follow and the process is reproducible To be valid, a submission must be contained in one kernel, made public on or before the submission deadline. Participants are free to use any datasets in addition to the Kaggle Data Science survey, but those datasets must also be publicly available on Kaggle by the deadline for a submission to be valid.

While the challenge is running, Kaggle will also give a Weekly Kernel Award of $1,500 to recognize excellent kernels that are public analyses of the survey. Weekly Kernel Awards will be announced every Friday between 11/9 and 11/30.

How to Participate To make a submission, complete the submission form. Only one submission will be judged per participant, so if you make multiple submissions we will review the last (most recent) entry.

No submission is necessary for the Weekly Kernels Awards. To be eligible, a kernel must be public and use the 2018 Data Science Survey as a data source.

Timeline All dates are 11:59PM UTC

Submission deadline: December 3rd

Winners announced: December 10th

Weekly Kernels Award prize winners announcements: November 9th, 16th, 23rd, and 30th

All kernels are evaluated after the deadline.

Rules To be eligible to win a prize in either of the above prize tracks, you must be:

a registered account holder at Kaggle.com; the older of 18 years old or the age of majority in your jurisdiction of residence; and not a resident of Crimea, Cuba, Iran, Syria, North Korea, or Sudan Your kernels will only be eligible to win if they have been made public on kaggle.com by the above deadline. All prizes are awarded at the discretion of Kaggle. Kaggle reserves the right to cancel or modify prize criteria.

Unfortunately employees, interns, contractors, officers and directors of Kaggle Inc., and their parent companies, are not eligible to win any prizes.

Survey Methodology ...
n
Dataset of Pairs of an Image and Tags for Cataloging Image-based Records
narcis.nl
data.mendeley.com
Updated Apr 19, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Suzuki, T (via Mendeley Data) (2022). Dataset of Pairs of an Image and Tags for Cataloging Image-based Records [Dataset]. http://doi.org/10.17632/msyc6mzvhg.2
Explore at:
Unique identifier
https://doi.org/10.17632/msyc6mzvhg.2
Dataset updated
Apr 19, 2022
Dataset provided by
Data Archiving and Networked Services (DANS)
Authors
Suzuki, T (via Mendeley Data)
Description
Brief ExplanationThis dataset is created to develop and evaluate a cataloging system which assigns appropriate metadata to an image record for database management in digital libraries. That is assumed for evaluating a task, in which given an image and assigned tags, an appropriate Wikipedia page is selected for each of the given tags.A main characteristic of the dataset is including ambiguous tags. Thus, visual contents of images are not unique to their tags. For example, it includes a tag 'mouse' which has double meaning of not a mammal but a computer controller device. The annotations are corresponding Wikipedia articles for tags as correct entities by human judgement.The dataset offers both data and programs that reproduce experiments of the above-mentioned task. Its data consist of sources of images and annotations. The image sources are URLs of 420 images uploaded to Flickr. The annotations are a total 2,464 relevant Wikipedia pages manually judged for tags of the images. The dataset also provides programs in Jupiter notebook (scripts.ipynb) to conduct a series of experiments running some baseline methods for the designated task and evaluating the results. ## Structure of the Dataset1. data directory 1.1. image_URL.txt This file lists URLs of image files. 1.2. rels.txt This file lists collect Wikipedia pages for each topic in topics.txt 1.3. topics.txt This file lists a target pair, which is called a topic in this dataset, of an image and a tag to be disambiguated. 1.4. enwiki_20171001.xml This file is extracted texts from the title and body parts of English Wikipedia articles as of 1st October 2017. This is a modified data of Wikipedia dump data (https://archive.org/download/enwiki-20171001).2. img directory This directory is a placeholder directory to fetch image files for downloading.3. results directory This directory is a placeholder directory to store results files for evaluation. It maintains three results of baseline methods in sub-directories. They contain json files each of which is a result of one topic, and are ready to be evaluated using an evaluation scripts in scripts.ipynb for reference of both usage and performance. 4. scripts.ipynb The scripts for running baseline methods and evaluation are ready in this Jupyter notebook file.
T
Overdose Data (CFD)
data.cincinnati-oh.gov
csv, xlsx, xml
Updated Dec 2, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
City of Cincinnati (2025). Overdose Data (CFD) [Dataset]. https://data.cincinnati-oh.gov/Safety/Overdose-Data-CFD-/n6qn-tghq
Explore at:
csv, xml, xlsxAvailable download formats
Dataset updated
Dec 2, 2025
Dataset authored and provided by
City of Cincinnati
License
U.S. Government Workshttps://www.usa.gov/government-works
License information was derived automatically
Description
Data Description: Fire Incident data includes all fire incident responses. This includes emergency medical services (EMS) calls, fires, rescue incidents, and all other services handled by the Fire Department. All runs are coded according to classification: for EMS, this includes ALS (advanced life support); BLS (basic life support); etc.

Data Creation: This data is created when a run is entered into the City of Cincinnati’s computer-aided dispatch (CAD) database.

Data Created By: The source of this data is the City of Cincinnati's computer aided dispatch (CAD) database.

Refresh Frequency: This data is updated daily.

CincyInsights: The City of Cincinnati maintains an interactive dashboard portal, CincyInsights in addition to our Open Data in an effort to increase access and usage of city data. This data set has an associated dashboard available here: https://insights.cincinnati-oh.gov/stories/s/6jrc-cmn5

Data Dictionary: A data dictionary providing definitions of columns and attributes is available as an attachment to this dataset.

Processing: The City of Cincinnati is committed to providing the most granular and accurate data possible. In that pursuit the Office of Performance and Data Analytics facilitates standard processing to most raw data prior to publication. Processing includes but is not limited: address verification, geocoding, decoding attributes, and addition of administrative areas (i.e. Census, neighborhoods, police districts, etc.).

Data Usage: For directions on downloading and using open data please visit our How-to Guide: https://data.cincinnati-oh.gov/dataset/Open-Data-How-To-Guide/gdr9-g3ad

Disclaimer: In compliance with privacy laws, all Public Safety datasets are anonymized and appropriately redacted prior to publication on the City of Cincinnati’s Open Data Portal. This means that for all public safety datasets: (1) the last two digits of all addresses have been replaced with “XX,” and in cases where there is a single digit street address, the entire address number is replaced with "X"; and (2) Latitude and Longitude have been randomly skewed to represent values within the same block area (but not the exact location) of the incident.
Data from: Serial Speakers: a Dataset of TV Series
figshare.com
txt
Updated May 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Xavier Bost; Vincent Labatut; Georges Linarès (2023). Serial Speakers: a Dataset of TV Series [Dataset]. http://doi.org/10.6084/m9.figshare.3471839.v11
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.3471839.v11
Dataset updated
May 30, 2023
Dataset provided by
figshare
Figsharehttp://figshare.com/
Authors
Xavier Bost; Vincent Labatut; Georges Linarès
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Dataset of three TV Series with manual annotations.Cite as:@inproceedings{Bost2020, title = {Serial Speakers: a Dataset of TV Series}, author = {Bost, Xavier and Labatut, Vincent and Linares, Georges}, url = {https://hal.archives-ouvertes.fr/hal-02477736}, booktitle = {12th International Conference on Language Resources and Evaluation (LREC 2020)}, address = {Marseille, France}, year = {2020}} The dataset consists of 3 TV series: - Breaking Bad: S01--S05 (file 'bb.json')- Game of Thrones: S01--08 (file 'got.json')- House of Cards: S01--S02 (file 'hoc.json')All three files are in .json format and contain TV Series annotated data.Each TV Series is defined by its name,A TV Series contains seasons, defined by their ids.Every season is made of episodes, defined by their ids, titles, duration and fps .Each episode contains two basic kinds of data: scenes and speech segments.Scenes are defined by starting points and are made of shots (Seasons 1 only).A shot is defined by:- Starting and ending positions.- Recurring shot ids.The speech segments are defined by their:- Starting and ending points.- Textual content (here encrypted for copyright reasons).- Speaker.- Possible interlocutors (for the following episodes only: bb: S01E04, S01E06, S02E03, S02E04; got: S01E03, S01E07, S01E08; hoc: S01E01, S01E07, S01E11).All timestamps are expressed in seconds and are valid for the video files extracted from the commercial DVDs (PAL 25 FPS), with recaps (unannotated) included at the beginning of the House of Cards episodes.In you are interested in the textual content of the dataset, please consider using our text recovering tool on GitHub:https://github.com/bostxavier/Serial-SpeakersA comprehensive description of the dataset can be found at:https://hal.archives-ouvertes.fr/hal-02477736

Facebook

Twitter

Click to copy link

Link copied

Cite

Isaac Wen (2022). GitHub Programming Languages Data [Dataset]. https://www.kaggle.com/datasets/isaacwen/github-programming-languages-data

GitHub Programming Languages Data

Statistics for Programming Languages used on GitHub

Explore at:

zip(41198 bytes)Available download formats

Dataset updated

Jan 2, 2022

Authors

Isaac Wen

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Context

A common question for those new and familiar to computer science and software engineering is what is the most best and/or most popular programming language. It is very difficult to give a definitive answer, as there are a seemingly indefinite number of metrics that can define the 'best' or 'most popular' programming language.

One such metric that can be used to define a 'popular' programming language is the number of projects and files that are made using that programming language. As GitHub is the most popular public collaboration and file-sharing platform, analyzing the languages that are used for repositories, PRs, and issues on GitHub and be a good indicator for the popularity of a language.

Content

This dataset contains statistics about the programming languages used for repositories, PRs, and issues on GitHub. The data is from 2011 to 2021.

Source

This data was queried and aggregated from BigQuery's public github_repos and githubarchive datasets.

Limitations

Only data for public GitHub repositories, and their corresponding PRs/issues, have their data available publicly. Thus, this dataset is only based on public repositories, which may not be fully representative of all repositories on GitHub.

Clear search

Close search

Google apps

Main menu

GitHub Programming Languages Data

Context

Content

Source

Limitations

Data from: A Generic Local Algorithm for Mining Data Streams in Large...

LScD (Leicester Scientific Dictionary)

English Wikipedia People Dataset

Summary

Data Structure

Stats

Maintenance and Support

Initial Data Collection and Normalization

Who are the source language producers?

Attribution

Data Dictionary.xlsx

Dataset

Contents

Data from: Discrete Dataset - A dataset for computing with operators defined...

UoS Buildings Image Dataset for Computer Vision Algorithms

COCO-style geographically unbiased image dataset for computer vision...

Dataset for: Some Remarks on the R2 for Clustering

An Ontology of Database Course

Dataset for Erasable Itemset Mining

WiDS data dictionary v2.xlsx

Dataset

Contents

Machine Learning Awards

Research Data Services in US Higher Education: A Web-Based Inventory - Raw...

Amazon Web Services Public Data Sets

Replication Data for: \"A Topic-based Segmentation Model for Identifying...

2025 Kaggle Machine Learning & Data Science Survey

Dataset of Pairs of an Image and Tags for Cataloging Image-based Records

Overdose Data (CFD)

Data from: Serial Speakers: a Dataset of TV Series

GitHub Programming Languages Data

Statistics for Programming Languages used on GitHub

Context

Content

Source

Limitations