75 datasets found
  1. Data from: MusicOSet: An Enhanced Open Dataset for Music Data Mining

    • zenodo.org
    • data.niaid.nih.gov
    bin, zip
    Updated Jun 7, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mariana O. Silva; Mariana O. Silva; Laís Mota; Mirella M. Moro; Mirella M. Moro; Laís Mota (2021). MusicOSet: An Enhanced Open Dataset for Music Data Mining [Dataset]. http://doi.org/10.5281/zenodo.4904639
    Explore at:
    zip, binAvailable download formats
    Dataset updated
    Jun 7, 2021
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Mariana O. Silva; Mariana O. Silva; Laís Mota; Mirella M. Moro; Mirella M. Moro; Laís Mota
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    MusicOSet is an open and enhanced dataset of musical elements (artists, songs and albums) based on musical popularity classification. Provides a directly accessible collection of data suitable for numerous tasks in music data mining (e.g., data visualization, classification, clustering, similarity search, MIR, HSS and so forth). To create MusicOSet, the potential information sources were divided into three main categories: music popularity sources, metadata sources, and acoustic and lyrical features sources. Data from all three categories were initially collected between January and May 2019. Nevertheless, the update and enhancement of the data happened in June 2019.

    The attractive features of MusicOSet include:

    • Integration and centralization of different musical data sources
    • Calculation of popularity scores and classification of hits and non-hits musical elements, varying from 1962 to 2018
    • Enriched metadata for music, artists, and albums from the US popular music industry
    • Availability of acoustic and lyrical resources
    • Unrestricted access in two formats: SQL database and compressed .csv files
    |    Data    | # Records |
    |:-----------------:|:---------:|
    | Songs       | 20,405  |
    | Artists      | 11,518  |
    | Albums      | 26,522  |
    | Lyrics      | 19,664  |
    | Acoustic Features | 20,405  |
    | Genres      | 1,561   |
  2. EMRBots: a 10,000-patient database

    • figshare.com
    zip
    Updated Sep 3, 2018
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Uri Kartoun (2018). EMRBots: a 10,000-patient database [Dataset]. http://doi.org/10.6084/m9.figshare.7040060.v3
    Explore at:
    zipAvailable download formats
    Dataset updated
    Sep 3, 2018
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Uri Kartoun
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    A 10,000-patient database that contains in total 10,000 virtual patients, 36,143 admissions, and 10,726,505 lab observations.

  3. l

    LSC (Leicester Scientific Corpus)

    • figshare.le.ac.uk
    Updated Apr 15, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Neslihan Suzen (2020). LSC (Leicester Scientific Corpus) [Dataset]. http://doi.org/10.25392/leicester.data.9449639.v2
    Explore at:
    Dataset updated
    Apr 15, 2020
    Dataset provided by
    University of Leicester
    Authors
    Neslihan Suzen
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Leicester
    Description

    The LSC (Leicester Scientific Corpus)

    April 2020 by Neslihan Suzen, PhD student at the University of Leicester (ns433@leicester.ac.uk) Supervised by Prof Alexander Gorban and Dr Evgeny MirkesThe data are extracted from the Web of Science [1]. You may not copy or distribute these data in whole or in part without the written consent of Clarivate Analytics.[Version 2] A further cleaning is applied in Data Processing for LSC Abstracts in Version 1*. Details of cleaning procedure are explained in Step 6.* Suzen, Neslihan (2019): LSC (Leicester Scientific Corpus). figshare. Dataset. https://doi.org/10.25392/leicester.data.9449639.v1.Getting StartedThis text provides the information on the LSC (Leicester Scientific Corpus) and pre-processing steps on abstracts, and describes the structure of files to organise the corpus. This corpus is created to be used in future work on the quantification of the meaning of research texts and make it available for use in Natural Language Processing projects.LSC is a collection of abstracts of articles and proceeding papers published in 2014, and indexed by the Web of Science (WoS) database [1]. The corpus contains only documents in English. Each document in the corpus contains the following parts:1. Authors: The list of authors of the paper2. Title: The title of the paper 3. Abstract: The abstract of the paper 4. Categories: One or more category from the list of categories [2]. Full list of categories is presented in file ‘List_of _Categories.txt’. 5. Research Areas: One or more research area from the list of research areas [3]. Full list of research areas is presented in file ‘List_of_Research_Areas.txt’. 6. Total Times cited: The number of times the paper was cited by other items from all databases within Web of Science platform [4] 7. Times cited in Core Collection: The total number of times the paper was cited by other papers within the WoS Core Collection [4]The corpus was collected in July 2018 online and contains the number of citations from publication date to July 2018. We describe a document as the collection of information (about a paper) listed above. The total number of documents in LSC is 1,673,350.Data ProcessingStep 1: Downloading of the Data Online

    The dataset is collected manually by exporting documents as Tab-delimitated files online. All documents are available online.Step 2: Importing the Dataset to R

    The LSC was collected as TXT files. All documents are extracted to R.Step 3: Cleaning the Data from Documents with Empty Abstract or without CategoryAs our research is based on the analysis of abstracts and categories, all documents with empty abstracts and documents without categories are removed.Step 4: Identification and Correction of Concatenate Words in AbstractsEspecially medicine-related publications use ‘structured abstracts’. Such type of abstracts are divided into sections with distinct headings such as introduction, aim, objective, method, result, conclusion etc. Used tool for extracting abstracts leads concatenate words of section headings with the first word of the section. For instance, we observe words such as ConclusionHigher and ConclusionsRT etc. The detection and identification of such words is done by sampling of medicine-related publications with human intervention. Detected concatenate words are split into two words. For instance, the word ‘ConclusionHigher’ is split into ‘Conclusion’ and ‘Higher’.The section headings in such abstracts are listed below:

    Background Method(s) Design Theoretical Measurement(s) Location Aim(s) Methodology Process Abstract Population Approach Objective(s) Purpose(s) Subject(s) Introduction Implication(s) Patient(s) Procedure(s) Hypothesis Measure(s) Setting(s) Limitation(s) Discussion Conclusion(s) Result(s) Finding(s) Material (s) Rationale(s) Implications for health and nursing policyStep 5: Extracting (Sub-setting) the Data Based on Lengths of AbstractsAfter correction, the lengths of abstracts are calculated. ‘Length’ indicates the total number of words in the text, calculated by the same rule as for Microsoft Word ‘word count’ [5].According to APA style manual [6], an abstract should contain between 150 to 250 words. In LSC, we decided to limit length of abstracts from 30 to 500 words in order to study documents with abstracts of typical length ranges and to avoid the effect of the length to the analysis.

    Step 6: [Version 2] Cleaning Copyright Notices, Permission polices, Journal Names and Conference Names from LSC Abstracts in Version 1Publications can include a footer of copyright notice, permission policy, journal name, licence, author’s right or conference name below the text of abstract by conferences and journals. Used tool for extracting and processing abstracts in WoS database leads to attached such footers to the text. For example, our casual observation yields that copyright notices such as ‘Published by Elsevier ltd.’ is placed in many texts. To avoid abnormal appearances of words in further analysis of words such as bias in frequency calculation, we performed a cleaning procedure on such sentences and phrases in abstracts of LSC version 1. We removed copyright notices, names of conferences, names of journals, authors’ rights, licenses and permission policies identified by sampling of abstracts.Step 7: [Version 2] Re-extracting (Sub-setting) the Data Based on Lengths of AbstractsThe cleaning procedure described in previous step leaded to some abstracts having less than our minimum length criteria (30 words). 474 texts were removed.Step 8: Saving the Dataset into CSV FormatDocuments are saved into 34 CSV files. In CSV files, the information is organised with one record on each line and parts of abstract, title, list of authors, list of categories, list of research areas, and times cited is recorded in fields.To access the LSC for research purposes, please email to ns433@le.ac.uk.References[1]Web of Science. (15 July). Available: https://apps.webofknowledge.com/ [2]WoS Subject Categories. Available: https://images.webofknowledge.com/WOKRS56B5/help/WOS/hp_subject_category_terms_tasca.html [3]Research Areas in WoS. Available: https://images.webofknowledge.com/images/help/WOS/hp_research_areas_easca.html [4]Times Cited in WoS Core Collection. (15 July). Available: https://support.clarivate.com/ScientificandAcademicResearch/s/article/Web-of-Science-Times-Cited-accessibility-and-variation?language=en_US [5]Word Count. Available: https://support.office.com/en-us/article/show-word-count-3c9e6a11-a04d-43b4-977c-563a0e0d5da3 [6]A. P. Association, Publication manual. American Psychological Association Washington, DC, 1983.

  4. United States No of Job Postings: New: Mining

    • ceicdata.com
    Updated Dec 20, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    CEICdata.com (2022). United States No of Job Postings: New: Mining [Dataset]. https://www.ceicdata.com/en/united-states/number-of-job-postings-new-by-industry/no-of-job-postings-new-mining
    Explore at:
    Dataset updated
    Dec 20, 2022
    Dataset provided by
    CEIC Data
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Dec 30, 2024 - Mar 17, 2025
    Area covered
    United States
    Description

    United States Number of Job Postings: New: Mining data was reported at 3,770.000 Unit in 05 May 2025. This records a decrease from the previous number of 4,432.000 Unit for 28 Apr 2025. United States Number of Job Postings: New: Mining data is updated weekly, averaging 1,112.000 Unit from Jan 2008 (Median) to 05 May 2025, with 905 observations. The data reached an all-time high of 15,942.000 Unit in 16 May 2022 and a record low of 75.000 Unit in 28 Apr 2008. United States Number of Job Postings: New: Mining data remains active status in CEIC and is reported by Revelio Labs, Inc.. The data is categorized under Global Database’s United States – Table US.RL.JP: Number of Job Postings: New: by Industry.

  5. z

    Source Data for Manuscript: Identifying genomic data use with the Data...

    • zenodo.org
    bin, tsv, zip
    Updated Aug 19, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Charles Parker; Charles Parker; Neil Byers; Neil Byers; Chris Beecroft; Chris Beecroft; Kjiersten Fagnan; Kjiersten Fagnan; George Garrity; George Garrity; Hugh Salamon; Hugh Salamon; TBK Reddy; TBK Reddy (2024). Source Data for Manuscript: Identifying genomic data use with the Data Citation Explorer [Dataset]. http://doi.org/10.5281/zenodo.12802877
    Explore at:
    zip, bin, tsvAvailable download formats
    Dataset updated
    Aug 19, 2024
    Dataset provided by
    US DOE Joint Genome Institute
    Authors
    Charles Parker; Charles Parker; Neil Byers; Neil Byers; Chris Beecroft; Chris Beecroft; Kjiersten Fagnan; Kjiersten Fagnan; George Garrity; George Garrity; Hugh Salamon; Hugh Salamon; TBK Reddy; TBK Reddy
    License

    Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
    License information was derived automatically

    Time period covered
    May 31, 2020
    Description

    This page contains the source data for the manuscript describing the Data Citation Explorer, currently in review for publication. The preprint version can be found on this page.

    Files:

    DCE_manual_eval_sample.xlsx:

    This file was used to manually evaluate hits generated by the Data Citation Explorer. There are two separate sheets: one with publications returned by searches in PubMed and PubMed Central and another with publications returned by searches in Dimensions. Column descriptions can be found in the file itself. Each row in each evaluation sheet refers to a pair between a JAMO record and a linked publication.

    DCE_citation_report.tsv

    Contains JAMO record IDs and PubMed IDs from the initial 2020 DCE trial run. There are 238,994 unique JAMO IDs and 25,007 unique PubMed IDs. 76,511 JAMO records are linked with publications.

    Columns:

    • jamo_id - unique JAMO record ID
    • citation_count - Number of citations associated with each record
    • citations - comma-delimited PubMed IDs for linked publications

    DCE_source_files.zip:

    This folder contains 3 files for each JAMO record in DCE_citation_report.tsv. For each JAMO record listed in the citation report, three files are provided:

    1. JAMO_ID_source.yaml - The fields extracted from the JAMO record that were relevant to the citation search, including any previously known PMIDs (manually curated).
    2. JAMO_ID_expand.yaml - The source record augmented with additional metadata discovered in other resources, including the citations that were discovered based on querying PubMed Central for the values in those metadata fields.
    3. JAMO_ID_audit.json - The audit path as a directed acyclic graph, in JSON.

    Of the ~238k JAMO records submitted to DCE, 6,979 contained anomalous fields that caused the records to be rejected for processing. This list is provided as NOT_PROCESSED.txt. Any records that were not processed are represented as zero-length files in the archive.

  6. Mexico Production: Silver: Sinaloa

    • ceicdata.com
    Updated Jan 15, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    CEICdata.com (2025). Mexico Production: Silver: Sinaloa [Dataset]. https://www.ceicdata.com/en/mexico/mining-production-by-region/production-silver-sinaloa
    Explore at:
    Dataset updated
    Jan 15, 2025
    Dataset provided by
    CEIC Data
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Feb 1, 2018 - Jan 1, 2019
    Area covered
    Mexico
    Variables measured
    Industrial Production
    Description

    Mexico Production: Silver: Sinaloa data was reported at 2,401.000 kg in Jan 2019. This records a decrease from the previous number of 2,407.000 kg for Dec 2018. Mexico Production: Silver: Sinaloa data is updated monthly, averaging 4,092.000 kg from Jan 1995 (Median) to Jan 2019, with 289 observations. The data reached an all-time high of 8,544.000 kg in Jun 1996 and a record low of 43.000 kg in May 2004. Mexico Production: Silver: Sinaloa data remains active status in CEIC and is reported by National Institute of Statistics and Geography. The data is categorized under Global Database’s Mexico – Table MX.B024: Mining Production: by Region.

  7. Z

    Data from: ApacheJIT: A Large Dataset for Just-In-Time Defect Prediction

    • data.niaid.nih.gov
    • explore.openaire.eu
    • +1more
    Updated Jan 27, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Keshavarz, Hossein (2022). ApacheJIT: A Large Dataset for Just-In-Time Defect Prediction [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_5907001
    Explore at:
    Dataset updated
    Jan 27, 2022
    Dataset provided by
    Nagappan, Meiyappan
    Keshavarz, Hossein
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    ApacheJIT: A Large Dataset for Just-In-Time Defect Prediction

    This archive contains the ApacheJIT dataset presented in the paper "ApacheJIT: A Large Dataset for Just-In-Time Defect Prediction" as well as the replication package. The paper is submitted to MSR 2022 Data Showcase Track.

    The datasets are available under directory dataset. There are 4 datasets in this directory.

    1. apachejit_total.csv: This file contains the entire dataset. Commits are specified by their identifier and a set of commit metrics that are explained in the paper are provided as features. Column buggy specifies whether or not the commit introduced any bug into the system.
    2. apachejit_train.csv: This file is a subset of the entire dataset. It provides a balanced set that we recommend for models that are sensitive to class imbalance. This set is obtained from the first 14 years of data (2003 to 2016).
    3. apachejit_test_large.csv: This file is a subset of the entire dataset. The commits in this file are the commits from the last 3 years of data. This set is not balanced to represent a real-life scenario in a JIT model evaluation where the model is trained on historical data to be applied on future data without any modification.
    4. apachejit_test_small.csv: This file is a subset of the test file explained above. Since the test file has more than 30,000 commits, we also provide a smaller test set which is still unbalanced and from the last 3 years of data.

    In addition to the dataset, we also provide the scripts using which we built the dataset. These scripts are written in Python 3.8. Therefore, Python 3.8 or above is required. To set up the environment, we have provided a list of required packages in file requirements.txt. Additionally, one filtering step requires GumTree [1]. For Java, GumTree requires Java 11. For other languages, external tools are needed. Installation guide and more details can be found here.

    The scripts are comprised of Python scripts under directory src and Python notebooks under directory notebooks. The Python scripts are mainly responsible for conducting GitHub search via GitHub search API and collecting commits through PyDriller Package [2]. The notebooks link the fixed issue reports with their corresponding fixing commits and apply some filtering steps. The bug-inducing candidates then are filtered again using gumtree.py script that utilizes the GumTree package. Finally, the remaining bug-inducing candidates are combined with the clean commits in the dataset_construction notebook to form the entire dataset.

    More specifically, git_token.py handles GitHub API token that is necessary for requests to GitHub API. Script collector.py performs GitHub search. Tracing changed lines and git annotate is done in gitminer.py using PyDriller. Finally, gumtree.py applies 4 filtering steps (number of lines, number of files, language, and change significance).

    References:

    1. GumTree

    Jean-Rémy Falleri, Floréal Morandat, Xavier Blanc, Matias Martinez, and Martin Monperrus. 2014. Fine-grained and accurate source code differencing. In ACM/IEEE International Conference on Automated Software Engineering, ASE ’14,Vasteras, Sweden - September 15 - 19, 2014. 313–324

    1. PyDriller
    • https://pydriller.readthedocs.io/en/latest/

    • Davide Spadini, Maurício Aniche, and Alberto Bacchelli. 2018. PyDriller: Python Framework for Mining Software Repositories. In Proceedings of the 2018 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering(Lake Buena Vista, FL, USA)(ESEC/FSE2018). Association for Computing Machinery, New York, NY, USA, 908–911

  8. o

    Reddit World News Post Analytics

    • opendatabay.com
    .undefined
    Updated Jul 8, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Datasimple (2025). Reddit World News Post Analytics [Dataset]. https://www.opendatabay.com/data/web-social/4f3e6b7d-569e-48b5-b3e8-6818eb389988
    Explore at:
    .undefinedAvailable download formats
    Dataset updated
    Jul 8, 2025
    Dataset authored and provided by
    Datasimple
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Area covered
    World, Data Science and Analytics
    Description

    This dataset provides insight into how public opinion shapes the world news cycle, offering public opinion engagement data from posts on the r/worldnews subreddit. It gathers posts on various topics such as politics, current affairs, socio-economic issues, sports, and entertainment. The dataset includes engagement metrics for each post, allowing for analysis of public sentiment. It is a valuable tool for assessing discussion threads, delving into individual posts to understand prevalent perspectives on world news, and analysing how stories on foreign policy, environmental action, and social movements influence our global outlook.

    Columns

    The worldnews.csv dataset includes the following columns: * title: The title of the post. (String) * score: The number of upvotes the post has received. (Integer) * id: A unique identifier for the post. (String) * url: The URL of the post. (String) * comms_num: The number of comments the post has received. (Integer) * created: The date and time the post was created. (Datetime) * body: The main text content of the post. (String) * timestamp: The date and time the post was last updated. (Datetime)

    Distribution

    The dataset is provided in CSV format. It contains 1,871 unique post IDs. While a total row count for the entire dataset is not explicitly stated, data is available in various ranges for scores, comments, and timestamps, indicating a substantial collection of records. For instance, timestamps span from 8th December 2022 to 15th December 2022.

    Usage

    This dataset is ideal for: * Understanding the most popular topics on world news by correlating post engagement with their subject matter. * Analysing differences in post engagement across various geographic regions to identify trending global issues. * Tracking changes in public opinion by monitoring engagement over time, particularly concerning specific news cycles or events. * Conducting deep dives into individual posts to ascertain which perspectives on world news gain the most traction. * Analysing how global stories, from foreign policy to environmental action and social movements, shape collective global outlook.

    Coverage

    The dataset offers global coverage of public opinion, as it is sourced from the r/worldnews subreddit. The time range for the included posts spans from 8th December 2022 to 15th December 2022. The scope primarily focuses on posts related to general world news, politics, current affairs, and socio-economic issues.

    License

    CC0

    Who Can Use It

    This dataset is well-suited for data science and analytics professionals, researchers, and anyone interested in: * Analysing public sentiment related to world events. * Studying the dynamics of online news consumption and engagement. * Exploring the relationship between social media discussions and global outlook. * Developing Natural Language Processing (NLP) models for text analysis and sentiment detection.

    Dataset Name Suggestions

    • Reddit World News Engagement Data
    • Global Public Opinion on News
    • r/worldnews Submission & Comment Data
    • World News Social Sentiment
    • Reddit World News Post Analytics

    Attributes

    Original Data Source: Reddit: /r/worldnews (Submissions & Comments)

  9. s

    BuzzCity mobile advertisement dataset

    • researchdata.smu.edu.sg
    • smu.edu.sg
    bin
    Updated May 30, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Living Analytics Research Centre (2023). BuzzCity mobile advertisement dataset [Dataset]. http://doi.org/10.25440/smu.12062703.v1
    Explore at:
    binAvailable download formats
    Dataset updated
    May 30, 2023
    Dataset provided by
    SMU Research Data Repository (RDR)
    Authors
    Living Analytics Research Centre
    License

    http://rightsstatements.org/vocab/InC/1.0/http://rightsstatements.org/vocab/InC/1.0/

    Description

    This competition involves advertisement data provided by BuzzCity Pte. Ltd. BuzzCity is a global mobile advertising network that has millions of consumers around the world on mobile phones and devices. In Q1 2012, over 45 billion ad banners were delivered across the BuzzCity network consisting of more than 10,000 publisher sites which reach an average of over 300 million unique users per month. The number of smartphones active on the network has also grown significantly. Smartphones now account for more than 32% phones that are served advertisements across the BuzzCity network. The "raw" data used in this competition has two types: publisher database and click database, both provided in CSV format. The publisher database records the publisher's (aka partner's) profile and comprises several fields:

    publisherid - Unique identifier of a publisher. Bankaccount - Bank account associated with a publisher (may be empty) address - Mailing address of a publisher (obfuscated; may be empty) status - Label of a publisher, which can be the following: "OK" - Publishers whom BuzzCity deems as having healthy traffic (or those who slipped their detection mechanisms) "Observation" - Publishers who may have just started their traffic or their traffic statistics deviates from system wide average. BuzzCity does not have any conclusive stand with these publishers yet "Fraud" - Publishers who are deemed as fraudulent with clear proof. Buzzcity suspends their accounts and their earnings will not be paid

    On the other hand, the click database records the click traffics and has several fields:

    id - Unique identifier of a particular click numericip - Public IP address of a clicker/visitor deviceua - Phone model used by a clicker/visitor publisherid - Unique identifier of a publisher adscampaignid - Unique identifier of a given advertisement campaign usercountry - Country from which the surfer is clicktime - Timestamp of a given click (in YYYY-MM-DD format) publisherchannel - Publisher's channel type, which can be the following: ad - Adult sites co - Community es - Entertainment and lifestyle gd - Glamour and dating in - Information mc - Mobile content pp - Premium portal se - Search, portal, services referredurl - URL where the ad banners were clicked (obfuscated; may be empty). More details about the HTTP Referer protocol can be found in this article. Related Publication: R. J. Oentaryo, E.-P. Lim, M. Finegold, D. Lo, F.-D. Zhu, C. Phua, E.-Y. Cheu, G.-E. Yap, K. Sim, M. N. Nguyen, K. Perera, B. Neupane, M. Faisal, Z.-Y. Aung, W. L. Woon, W. Chen, D. Patel, and D. Berrar. (2014). Detecting click fraud in online advertising: A data mining approach, Journal of Machine Learning Research, 15, 99-140.

  10. d

    Data from: Towards open data blockchain analytics: a Bitcoin perspective

    • search.dataone.org
    • datadryad.org
    • +1more
    Updated Jun 12, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dan McGinn; Douglas McIlwraith; Yike Guo (2025). Towards open data blockchain analytics: a Bitcoin perspective [Dataset]. http://doi.org/10.5061/dryad.h9r0p65
    Explore at:
    Dataset updated
    Jun 12, 2025
    Dataset provided by
    Dryad Digital Repository
    Authors
    Dan McGinn; Douglas McIlwraith; Yike Guo
    Time period covered
    Jul 9, 2018
    Description

    Bitcoin is the first implementation of a technology that has become known as a 'public permissionless' blockchain. Such systems allow public read/write access to an append-only blockchain database without the need for any mediating central authority. Instead they guarantee access, security and protocol conformity through an elegant combination of cryptographic assurances and game theoretic economic incentives. Not until the advent of the Bitcoin blockchain has such a trusted, transparent, comprehensive and granular data set of digital economic behaviours been available for public network analysis. In this article, by translating the cumbersome binary data structure of the Bitcoin blockchain into a high fidelity graph model, we demonstrate through various analyses the often overlooked social and econometric benefits of employing such a novel open data architecture. Specifically we show (a) how repeated patterns of transaction behaviours can be revealed to link user activity across t...

  11. m

    Data from: The geometric blueprint of perovskites

    • archive.materialscloud.org
    • materialscloud-archive-failover.cineca.it
    csv, pdf +1
    Updated Sep 3, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Marina R. Filip; Feliciano Giustino; Marina R. Filip; Feliciano Giustino (2018). The geometric blueprint of perovskites [Dataset]. http://doi.org/10.24435/materialscloud:2018.0012/v1
    Explore at:
    csv, pdf, text/markdownAvailable download formats
    Dataset updated
    Sep 3, 2018
    Dataset provided by
    Materials Cloud
    Authors
    Marina R. Filip; Feliciano Giustino; Marina R. Filip; Feliciano Giustino
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Perovskite minerals form an essential component of the Earth's mantle, and synthetic crystals are ubiquitous in electronics, photonics, and energy technology. The extraordinary chemical diversity of these crystals raises the question of how many and which perovskites are yet to be discovered. Here we show that the "no-rattling" principle postulated by Goldschmidt in 1926, describing the geometric conditions under which a perovskite can form, is much more effective than previously thought and allows us to predict perovskites with a fidelity of 80%. By supplementing this principle with inferential statistics and internet data mining we establish that currently known perovskites are only the tip of the iceberg, and we enumerate 90,000 hitherto-unknown compounds awaiting to be studied. Our results suggest that geometric blueprints may enable the systematic screening of millions of compounds and offer untapped opportunities in structure prediction and materials design.

  12. d

    Coresignal | Employee Data | From the Largest Professional Network | Global...

    • datarade.ai
    .json, .csv
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Coresignal, Coresignal | Employee Data | From the Largest Professional Network | Global / 712M+ Records / 5 Years of Historical Data / Updated Daily [Dataset]. https://datarade.ai/data-products/public-resume-data-coresignal
    Explore at:
    .json, .csvAvailable download formats
    Dataset authored and provided by
    Coresignal
    Area covered
    Brunei Darussalam, Latvia, Eritrea, Christmas Island, Russian Federation, French Guiana, Bosnia and Herzegovina, Palestine, Réunion, Macao
    Description

    ➡️ You can choose from multiple data formats, delivery frequency options, and delivery methods;

    ➡️ You can select raw or clean and AI-enriched datasets;

    ➡️ Multiple APIs designed for effortless search and enrichment (accessible using a user-friendly self-service tool);

    ➡️ Fresh data: daily updates, easy change tracking with dedicated data fields, and a constant flow of new data;

    ➡️ You get all necessary resources for evaluating our data: a free consultation, a data sample, or free credits for testing our APIs.

    Coresignal's employee data enables you to create and improve innovative data-driven solutions and extract actionable business insights. These datasets are popular among companies from different industries, including HR and sales technology and investment.

    Employee Data use cases:

    ✅ Source best-fit talent for your recruitment needs

    Coresignal's Employee Data can help source the best-fit talent for your recruitment needs by providing the most up-to-date information on qualified candidates globally.

    ✅ Fuel your lead generation pipeline

    Enhance lead generation with 712M+ up-to-date employee records from the largest professional network. Our Employee Data can help you develop a qualified list of potential clients and enrich your own database.

    ✅ Analyze talent for investment opportunities

    Employee Data can help you generate actionable signals and identify new investment opportunities earlier than competitors or perform deeper analysis of companies you're interested in.

    ➡️ Why 400+ data-powered businesses choose Coresignal:

    1. Experienced data provider (in the market since 2016);
    2. Exceptional client service;
    3. Responsible and secure data collection.
  13. c

    Dog Food Data Extracted from Chewy (USA) - 4,500 Records in CSV Format

    • crawlfeeds.com
    csv, zip
    Updated Apr 22, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Crawl Feeds (2025). Dog Food Data Extracted from Chewy (USA) - 4,500 Records in CSV Format [Dataset]. https://crawlfeeds.com/datasets/dog-food-data-extracted-from-chewy-usa-4-500-records-in-csv-format
    Explore at:
    zip, csvAvailable download formats
    Dataset updated
    Apr 22, 2025
    Dataset authored and provided by
    Crawl Feeds
    License

    https://crawlfeeds.com/privacy_policyhttps://crawlfeeds.com/privacy_policy

    Description

    The Dog Food Data Extracted from Chewy (USA) dataset contains 4,500 detailed records of dog food products sourced from one of the leading pet supply platforms in the United States, Chewy. This dataset is ideal for businesses, researchers, and data analysts who want to explore and analyze the dog food market, including product offerings, pricing strategies, brand diversity, and customer preferences within the USA.

    The dataset includes essential information such as product names, brands, prices, ingredient details, product descriptions, weight options, and availability. Organized in a CSV format for easy integration into analytics tools, this dataset provides valuable insights for those looking to study the pet food market, develop marketing strategies, or train machine learning models.

    Key Features:

    • Record Count: 4,500 dog food product records.
    • Data Fields: Product names, brands, prices, descriptions, ingredients .. etc. Find more fields under data points section.
    • Format: CSV, easy to import into databases and data analysis tools.
    • Source: Extracted from Chewy’s official USA platform.
    • Geography: Focused on the USA dog food market.

    Use Cases:

    • Market Research: Analyze trends and preferences in the USA dog food market, including popular brands, price ranges, and product availability.
    • E-commerce Analysis: Understand how Chewy presents and prices dog food products, helping businesses compare their own product offerings.
    • Competitor Analysis: Compare different brands and products to develop competitive strategies for dog food businesses.
    • Machine Learning Models: Use the dataset for machine learning tasks such as product recommendation systems, demand forecasting, and price optimization.

  14. A Labelled Dataset for Sentiment Analysis of Videos on YouTube, TikTok, and...

    • zenodo.org
    • data.niaid.nih.gov
    • +2more
    csv
    Updated Jul 20, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nirmalya Thakur; Nirmalya Thakur; Vanessa Su; Mingchen Shao; Kesha A. Patel; Hongseok Jeong; Victoria Knieling; Andrew Bian; Vanessa Su; Mingchen Shao; Kesha A. Patel; Hongseok Jeong; Victoria Knieling; Andrew Bian (2024). A Labelled Dataset for Sentiment Analysis of Videos on YouTube, TikTok, and other sources about the 2024 outbreak of Measles [Dataset]. http://doi.org/10.5281/zenodo.11711230
    Explore at:
    csvAvailable download formats
    Dataset updated
    Jul 20, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Nirmalya Thakur; Nirmalya Thakur; Vanessa Su; Mingchen Shao; Kesha A. Patel; Hongseok Jeong; Victoria Knieling; Andrew Bian; Vanessa Su; Mingchen Shao; Kesha A. Patel; Hongseok Jeong; Victoria Knieling; Andrew Bian
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Jun 15, 2024
    Area covered
    YouTube
    Description

    Please cite the following paper when using this dataset:

    N. Thakur, V. Su, M. Shao, K. Patel, H. Jeong, V. Knieling, and A. Bian “A labelled dataset for sentiment analysis of videos on YouTube, TikTok, and other sources about the 2024 outbreak of measles,” Proceedings of the 26th International Conference on Human-Computer Interaction (HCII 2024), Washington, USA, 29 June - 4 July 2024. (Accepted as a Late Breaking Paper, Preprint Available at: https://doi.org/10.48550/arXiv.2406.07693)

    Abstract

    This dataset contains the data of 4011 videos about the ongoing outbreak of measles published on 264 websites on the internet between January 1, 2024, and May 31, 2024. These websites primarily include YouTube and TikTok, which account for 48.6% and 15.2% of the videos, respectively. The remainder of the websites include Instagram and Facebook as well as the websites of various global and local news organizations. For each of these videos, the URL of the video, title of the post, description of the post, and the date of publication of the video are presented as separate attributes in the dataset. After developing this dataset, sentiment analysis (using VADER), subjectivity analysis (using TextBlob), and fine-grain sentiment analysis (using DistilRoBERTa-base) of the video titles and video descriptions were performed. This included classifying each video title and video description into (i) one of the sentiment classes i.e. positive, negative, or neutral, (ii) one of the subjectivity classes i.e. highly opinionated, neutral opinionated, or least opinionated, and (iii) one of the fine-grain sentiment classes i.e. fear, surprise, joy, sadness, anger, disgust, or neutral. These results are presented as separate attributes in the dataset for the training and testing of machine learning algorithms for performing sentiment analysis or subjectivity analysis in this field as well as for other applications. The paper associated with this dataset (please see the above-mentioned citation) also presents a list of open research questions that may be investigated using this dataset.

  15. f

    Table1_Identifying oral disease variables associated with pneumonia...

    • frontiersin.figshare.com
    docx
    Updated Jul 18, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Neel Shimpi; Ingrid Glurich; Aloksagar Panny; Harshad Hegde; Frank A. Scannapieco; Amit Acharya (2024). Table1_Identifying oral disease variables associated with pneumonia emergence by application of machine learning to integrated medical and dental big data to inform eHealth approaches.docx [Dataset]. http://doi.org/10.3389/fdmed.2022.1005140.s002
    Explore at:
    docxAvailable download formats
    Dataset updated
    Jul 18, 2024
    Dataset provided by
    Frontiers
    Authors
    Neel Shimpi; Ingrid Glurich; Aloksagar Panny; Harshad Hegde; Frank A. Scannapieco; Amit Acharya
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    BackgroundThe objective of this study was to build models that define variables contributing to pneumonia risk by applying supervised Machine Learning (ML) to medical and oral disease data to define key risk variables contributing to pneumonia emergence for any pneumonia/pneumonia subtypes.MethodsRetrospective medical and dental data were retrieved from the Marshfield Clinic Health System's data warehouse and the integrated electronic medical-dental health records (iEHR). Retrieved data were preprocessed prior to conducting analyses and included matching of cases to controls by (a) race/ethnicity and (b) 1:1 Case: Control ratio. Variables with >30% missing data were excluded from analysis. Datasets were divided into four subsets: (1) All Pneumonia (all cases and controls); (2) community (CAP)/healthcare-associated (HCAP) pneumonias; (3) ventilator-associated (VAP)/hospital-acquired (HAP) pneumonias; and (4) aspiration pneumonia (AP). Performance of five algorithms was compared across the four subsets: Naïve Bayes, Logistic Regression, Support Vector Machine (SVM), Multi Layer Perceptron (MLP), and Random Forests. Feature (input variables) selection and 10-fold cross validation was performed on all the datasets. An evaluation set (10%) was extracted from the subsets for further validation. Model performance was evaluated in terms of total accuracy, sensitivity, specificity, F-measure, Mathews-correlation-coefficient, and area under receiver operating characteristic curve (AUC).ResultsIn total, 6,034 records (cases and controls) met eligibility for inclusion in the main dataset. After feature selection, the variables retained in the subsets were: All Pneumonia (n = 29 variables), CAP-HCAP (n = 26 variables), VAP-HAP (n = 40 variables), and AP (n = 37 variables). Variables retained (n = 22) were common across all four pneumonia subsets. Of these, the number of missing teeth, periodontal status, periodontal pocket depth more than 5 mm, and number of restored teeth contributed to all the subsets and were retained in the model. MLP outperformed other predictive models for All Pneumonia, CAP-HCAP, and AP subsets, while SVM outperformed other models in VAP-HAP subset.ConclusionThis study validates previously described associations between poor oral health and pneumonia. Benefits of an integrated medical-dental record and care delivery environment for modeling pneumonia risk are highlighted. Based on findings, risk score development could inform referrals and follow-up in integrated healthcare delivery environments and coordinated patient management.

  16. CYGNSS Level 1 Science Data Record Version 2.1 - Dataset - NASA Open Data...

    • data.nasa.gov
    • data.staging.idas-ds1.appdat.jsc.nasa.gov
    Updated Apr 1, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    nasa.gov (2025). CYGNSS Level 1 Science Data Record Version 2.1 - Dataset - NASA Open Data Portal [Dataset]. https://data.nasa.gov/dataset/cygnss-level-1-science-data-record-version-2-1-c4d25
    Explore at:
    Dataset updated
    Apr 1, 2025
    Dataset provided by
    NASAhttp://nasa.gov/
    Description

    This Level 1 (L1) dataset contains the Version 2.1 geo-located Delay Doppler Maps (DDMs) calibrated into Power Received (Watts) and Bistatic Radar Cross Section (BRCS) expressed in units of meters squared from the Delay Doppler Mapping Instrument aboard the CYGNSS satellite constellation. This version supersedes Version 2.0. Other useful scientific and engineering measurement parameters include the DDM of Normalized Bistatic Radar Cross Section (NBRCS), the Delay Doppler Map Average (DDMA) of the NBRCS near the specular reflection point, and the Leading Edge Slope (LES) of the integrated delay waveform. The L1 dataset contains a number of other engineering and science measurement parameters, including sets of quality flags/indicators, error estimates, and bias estimates as well as a variety of orbital, spacecraft/sensor health, timekeeping, and geolocation parameters. At most, 8 netCDF data files (each file corresponding to a unique spacecraft in the CYGNSS constellation) are provided each day; under nominal conditions, there are typically 6-8 spacecraft retrieving data each day, but this can be maximized to 8 spacecraft under special circumstances in which higher than normal retrieval frequency is needed (i.e., during tropical storms and or hurricanes). Latency is approximately 6 days (or better) from the last recorded measurement time. The Version 2.1 release represents the second science-quality release. Here is a summary of improvements that reflect the quality of the Version 2.1 data release: 1) data is now available when the CYGNSS satellites are rolled away from nadir during orbital high beta-angle periods, resulting in a significant amount of additional data; 2) correction to coordinate frames result in more accurate estimates of receiver antenna gain at the specular point; 3) improved calibration for analog-to-digital conversion results in better consistency between CYGNSS satellites measurements at nearly the same location and time; 4) improved GPS EIRP and transmit antenna pattern calibration results in significantly reduced PRN-dependence in the observables; 5) improved estimation of the location of the specular point within the DDM; 6) an altitude-dependent scattering area is used to normalize the scattering cross section (v2.0 used a simpler scattering area model that varied with incidence and azimuth angles but not altitude); 7) corrections added for noise floor-dependent biases in scattering cross section and leading edge slope of delay waveform observed in the v2.0 data. Users should also note that the receiver antenna pattern calibration is not applied per-DDM-bin in this v2.1 release.

  17. Best Books Ever Dataset

    • zenodo.org
    csv
    Updated Nov 10, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lorena Casanova Lozano; Sergio Costa Planells; Lorena Casanova Lozano; Sergio Costa Planells (2020). Best Books Ever Dataset [Dataset]. http://doi.org/10.5281/zenodo.4265096
    Explore at:
    csvAvailable download formats
    Dataset updated
    Nov 10, 2020
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Lorena Casanova Lozano; Sergio Costa Planells; Lorena Casanova Lozano; Sergio Costa Planells
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    The dataset has been collected in the frame of the Prac1 of the subject Tipology and Data Life Cycle of the Master's Degree in Data Science of the Universitat Oberta de Catalunya (UOC).

    The dataset contains 25 variables and 52478 records corresponding to books on the GoodReads Best Books Ever list (the larges list on the site).

    Original code used to retrieve the dataset can be found on github repository: github.com/scostap/goodreads_bbe_dataset

    The data was retrieved in two sets, the first 30000 books and then the remainig 22478. Dates were not parsed and reformated on the second chunk so publishDate and firstPublishDate are representet in a mm/dd/yyyy format for the first 30000 records and Month Day Year for the rest.

    Book cover images can be optionally downloaded from the url in the 'coverImg' field. Python code for doing so and an example can be found on the github repo.

    The 25 fields of the dataset are:

    | Attributes | Definition | Completeness |
    | ------------- | ------------- | ------------- | 
    | bookId | Book Identifier as in goodreads.com | 100 |
    | title | Book title | 100 |
    | series | Series Name | 45 |
    | author | Book's Author | 100 |
    | rating | Global goodreads rating | 100 |
    | description | Book's description | 97 |
    | language | Book's language | 93 |
    | isbn | Book's ISBN | 92 |
    | genres | Book's genres | 91 |
    | characters | Main characters | 26 |
    | bookFormat | Type of binding | 97 |
    | edition | Type of edition (ex. Anniversary Edition) | 9 |
    | pages | Number of pages | 96 |
    | publisher | Editorial | 93 |
    | publishDate | publication date | 98 |
    | firstPublishDate | Publication date of first edition | 59 |
    | awards | List of awards | 20 |
    | numRatings | Number of total ratings | 100 |
    | ratingsByStars | Number of ratings by stars | 97 |
    | likedPercent | Derived field, percent of ratings over 2 starts (as in GoodReads) | 99 |
    | setting | Story setting | 22 |
    | coverImg | URL to cover image | 99 |
    | bbeScore | Score in Best Books Ever list | 100 |
    | bbeVotes | Number of votes in Best Books Ever list | 100 |
    | price | Book's price (extracted from Iberlibro) | 73 |

  18. W

    SA Mining and Production Tenement Applications

    • cloud.csiss.gmu.edu
    • researchdata.edu.au
    • +2more
    zip
    Updated Dec 14, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Australia (2019). SA Mining and Production Tenement Applications [Dataset]. https://cloud.csiss.gmu.edu/uddi/dataset/f87ed672-cde6-4526-a79f-8e048183dbd5
    Explore at:
    zip(230863)Available download formats
    Dataset updated
    Dec 14, 2019
    Dataset provided by
    Australia
    License
    Description

    Abstract

    This dataset and its metadata statement were supplied to the Bioregional Assessment Programme by a third party and are presented here as originally supplied.

    Location of all tenements issued under the Mining Act, 1971. The types of tenement are:

    Mineral Claim (MCA) - An MC provides an exclusive right for 12 months to prospect for minerals within the claim area (conditions apply).

    Miscellaneous Purposes Licence (MPLA) - An MPL may be granted for any purpose ancillary to the conduct of mining operations, for example an operating plant, drainage or an access road.

    Purpose

    The dataset was developed to record information necessary for the administration of the Mining Act, 1971.

    Use: Used to supply government, industry and the general public with an up-to-date status and extent of mining production tenements throughout the state.

    Dataset History

    Source Data History: Departmental hardcopy tenement records are the primary data source. These records are derived from information supplied by applicants. Applicant information is often schematic and the scale of the final document is determined by the most appropriate scale maps available at the time. Applicant information and associated records used for this project date back over the last 30 years. The source date is dependent upon the date of application. The dataset contains an accuracy description for each tenement. From time to time, applications are confirmed by field staff using GPS. Processing Steps: A majority of tenements were reconciled using existing digital cadastral boundaries (DCDB) and the remainder digitised from the above-mentioned departmental records. Tenement boundaries were mathematically constructed to match the DCDB and existing tenement boundaries (where applicable), whilst other boundaries were wholly constructed mathematically. These procedures have been automated using a series of maintenance procedures to assist with manual editing and boundary construction processes. Once tenements have been constructed, attributes are assigned against the region (polygon) feature class. Additional information is recorded against the boundaries to identify how the line work was captured. Hard copy plots have been produced at various set scales depicting tenements on cadastral or topographic backgrounds. The geometry of the tenement is value-added through a join to the Mining Register database, where only matching records between the geometry of the tenement and the Mining Register records are portrayed.

    Dataset Citation

    "SA Department for Manufacturing, Innovation, Trade, Resources and Energy" (2013) SA Mining and Production Tenement Applications. Bioregional Assessment Source Dataset. Viewed 12 December 2018, http://data.bioregionalassessments.gov.au/dataset/f87ed672-cde6-4526-a79f-8e048183dbd5.

  19. d

    Onto-Design

    • dknet.org
    • scicrunch.org
    • +1more
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Onto-Design [Dataset]. http://identifiers.org/RRID:SCR_000601
    Explore at:
    Description

    THIS RESOURCE IS NO LONGER IN SERVICE. Documented on September 6,2023. Many Laboratories chose to design and print their own microarrays. At present, the choice of the genes to include on a certain microarray is a very laborious process requiring a high level of expertise. Onto-Design database is able to assist the designers of custom microarrays by providing the means to select genes based on their experiment. Design custom microarrays based on GO terms of interest. User account required. Platform: Online tool

  20. Data from: An Empirical Study of Activity, Popularity, Size, Testing, and...

    • zenodo.org
    csv
    Updated Jan 24, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Aakash Gautam; Saket Vishwasrao; Francisco Servant; Aakash Gautam; Saket Vishwasrao; Francisco Servant (2020). An Empirical Study of Activity, Popularity, Size, Testing, and Stability in Continuous Integration [Dataset]. http://doi.org/10.5281/zenodo.439362
    Explore at:
    csvAvailable download formats
    Dataset updated
    Jan 24, 2020
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Aakash Gautam; Saket Vishwasrao; Francisco Servant; Aakash Gautam; Saket Vishwasrao; Francisco Servant
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    A good understanding of the practices followed by software development projects can positively impact their success --- particularly for attracting talent and on-boarding new members. In this paper, we perform a cluster analysis to classify software projects that follow continuous integration in terms of their activity, popularity, size, testing, and stability. Based on this analysis, we identify and discuss four different groups of repositories that have distinct characteristics that separates them from the other groups. With this new understanding, we encourage open source projects to acknowledge and advertise their preferences according to these defining characteristics, so that they can recruit developers who share similar values.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Mariana O. Silva; Mariana O. Silva; Laís Mota; Mirella M. Moro; Mirella M. Moro; Laís Mota (2021). MusicOSet: An Enhanced Open Dataset for Music Data Mining [Dataset]. http://doi.org/10.5281/zenodo.4904639
Organization logo

Data from: MusicOSet: An Enhanced Open Dataset for Music Data Mining

Related Article
Explore at:
2 scholarly articles cite this dataset (View in Google Scholar)
zip, binAvailable download formats
Dataset updated
Jun 7, 2021
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Mariana O. Silva; Mariana O. Silva; Laís Mota; Mirella M. Moro; Mirella M. Moro; Laís Mota
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

MusicOSet is an open and enhanced dataset of musical elements (artists, songs and albums) based on musical popularity classification. Provides a directly accessible collection of data suitable for numerous tasks in music data mining (e.g., data visualization, classification, clustering, similarity search, MIR, HSS and so forth). To create MusicOSet, the potential information sources were divided into three main categories: music popularity sources, metadata sources, and acoustic and lyrical features sources. Data from all three categories were initially collected between January and May 2019. Nevertheless, the update and enhancement of the data happened in June 2019.

The attractive features of MusicOSet include:

  • Integration and centralization of different musical data sources
  • Calculation of popularity scores and classification of hits and non-hits musical elements, varying from 1962 to 2018
  • Enriched metadata for music, artists, and albums from the US popular music industry
  • Availability of acoustic and lyrical resources
  • Unrestricted access in two formats: SQL database and compressed .csv files
|    Data    | # Records |
|:-----------------:|:---------:|
| Songs       | 20,405  |
| Artists      | 11,518  |
| Albums      | 26,522  |
| Lyrics      | 19,664  |
| Acoustic Features | 20,405  |
| Genres      | 1,561   |
Search
Clear search
Close search
Google apps
Main menu