21 datasets found
  1. A Replication Dataset for Fundamental Frequency Estimation

    • zenodo.org
    • live.european-language-grid.eu
    • +1more
    bin
    Updated Apr 24, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bastian Bechtold; Bastian Bechtold (2025). A Replication Dataset for Fundamental Frequency Estimation [Dataset]. http://doi.org/10.5281/zenodo.3904389
    Explore at:
    binAvailable download formats
    Dataset updated
    Apr 24, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Bastian Bechtold; Bastian Bechtold
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Part of the dissertation Pitch of Voiced Speech in the Short-Time Fourier Transform: Algorithms, Ground Truths, and Evaluation Methods.
    © 2020, Bastian Bechtold. All rights reserved.

    Estimating the fundamental frequency of speech remains an active area of research, with varied applications in speech recognition, speaker identification, and speech compression. A vast number of algorithms for estimatimating this quantity have been proposed over the years, and a number of speech and noise corpora have been developed for evaluating their performance. The present dataset contains estimated fundamental frequency tracks of 25 algorithms, six speech corpora, two noise corpora, at nine signal-to-noise ratios between -20 and 20 dB SNR, as well as an additional evaluation of synthetic harmonic tone complexes in white noise.

    The dataset also contains pre-calculated performance measures both novel and traditional, in reference to each speech corpus’ ground truth, the algorithms’ own clean-speech estimate, and our own consensus truth. It can thus serve as the basis for a comparison study, or to replicate existing studies from a larger dataset, or as a reference for developing new fundamental frequency estimation algorithms. All source code and data is available to download, and entirely reproducible, albeit requiring about one year of processor-time.

    Included Code and Data

    • ground truth data.zip is a JBOF dataset of fundamental frequency estimates and ground truths of all speech files in the following corpora:
      • CMU-ARCTIC (consensus truth) [1]
      • FDA (corpus truth and consensus truth) [2]
      • KEELE (corpus truth and consensus truth) [3]
      • MOCHA-TIMIT (consensus truth) [4]
      • PTDB-TUG (corpus truth and consensus truth) [5]
      • TIMIT (consensus truth) [6]
    • noisy speech data.zip is a JBOF datasets of fundamental frequency estimates of speech files mixed with noise from the following corpora:
    • synthetic speech data.zip is a JBOF dataset of fundamental frequency estimates of synthetic harmonic tone complexes in white noise.
    • noisy_speech.pkl and synthetic_speech.pkl are pickled Pandas dataframes of performance metrics derived from the above data for the following list of fundamental frequency estimation algorithms:
    • noisy speech evaluation.py and synthetic speech evaluation.py are Python programs to calculate the above Pandas dataframes from the above JBOF datasets. They calculate the following performance measures:
      • Gross Pitch Error (GPE), the percentage of pitches where the estimated pitch deviates from the true pitch by more than 20%.
      • Fine Pitch Error (FPE), the mean error of grossly correct estimates.
      • High/Low Octave Pitch Error (OPE), the percentage pitches that are GPEs and happens to be at an integer multiple of the true pitch.
      • Gross Remaining Error (GRE), the percentage of pitches that are GPEs but not OPEs.
      • Fine Remaining Bias (FRB), the median error of GREs.
      • True Positive Rate (TPR), the percentage of true positive voicing estimates.
      • False Positive Rate (FPR), the percentage of false positive voicing estimates.
      • False Negative Rate (FNR), the percentage of false negative voicing estimates.
      • F₁, the harmonic mean of precision and recall of the voicing decision.
    • Pipfile is a pipenv-compatible pipfile for installing all prerequisites necessary for running the above Python programs.

    The Python programs take about an hour to compute on a fast 2019 computer, and require at least 32 Gb of memory.

    References:

    1. John Kominek and Alan W Black. CMU ARCTIC database for speech synthesis, 2003.
    2. Paul C Bagshaw, Steven Hiller, and Mervyn A Jack. Enhanced Pitch Tracking and the Processing of F0 Contours for Computer Aided Intonation Teaching. In EUROSPEECH, 1993.
    3. F Plante, Georg F Meyer, and William A Ainsworth. A Pitch Extraction Reference Database. In Fourth European Conference on Speech Communication and Technology, pages 837–840, Madrid, Spain, 1995.
    4. Alan Wrench. MOCHA MultiCHannel Articulatory database: English, November 1999.
    5. Gregor Pirker, Michael Wohlmayr, Stefan Petrik, and Franz Pernkopf. A Pitch Tracking Corpus with Evaluation on Multipitch Tracking Scenario. page 4, 2011.
    6. John S. Garofolo, Lori F. Lamel, William M. Fisher, Jonathan G. Fiscus, David S. Pallett, Nancy L. Dahlgren, and Victor Zue. TIMIT Acoustic-Phonetic Continuous Speech Corpus, 1993.
    7. Andrew Varga and Herman J.M. Steeneken. Assessment for automatic speech recognition: II. NOISEX-92: A database and an experiment to study the effect of additive noise on speech recog- nition systems. Speech Communication, 12(3):247–251, July 1993.
    8. David B. Dean, Sridha Sridharan, Robert J. Vogt, and Michael W. Mason. The QUT-NOISE-TIMIT corpus for the evaluation of voice activity detection algorithms. Proceedings of Interspeech 2010, 2010.
    9. Man Mohan Sondhi. New methods of pitch extraction. Audio and Electroacoustics, IEEE Transactions on, 16(2):262—266, 1968.
    10. Myron J. Ross, Harry L. Shaffer, Asaf Cohen, Richard Freudberg, and Harold J. Manley. Average magnitude difference function pitch extractor. Acoustics, Speech and Signal Processing, IEEE Transactions on, 22(5):353—362, 1974.
    11. Na Yang, He Ba, Weiyang Cai, Ilker Demirkol, and Wendi Heinzelman. BaNa: A Noise Resilient Fundamental Frequency Detection Algorithm for Speech and Music. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 22(12):1833–1848, December 2014.
    12. Michael Noll. Cepstrum Pitch Determination. The Journal of the Acoustical Society of America, 41(2):293–309, 1967.
    13. Jong Wook Kim, Justin Salamon, Peter Li, and Juan Pablo Bello. CREPE: A Convolutional Representation for Pitch Estimation. arXiv:1802.06182 [cs, eess, stat], February 2018. arXiv: 1802.06182.
    14. Masanori Morise, Fumiya Yokomori, and Kenji Ozawa. WORLD: A Vocoder-Based High-Quality Speech Synthesis System for Real-Time Applications. IEICE Transactions on Information and Systems, E99.D(7):1877–1884, 2016.
    15. Kun Han and DeLiang Wang. Neural Network Based Pitch Tracking in Very Noisy Speech. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 22(12):2158–2168, Decem- ber 2014.
    16. Pegah Ghahremani, Bagher BabaAli, Daniel Povey, Korbinian Riedhammer, Jan Trmal, and Sanjeev Khudanpur. A pitch extraction algorithm tuned for automatic speech recognition. In Acoustics, Speech and Signal Processing (ICASSP), 2014 IEEE International Conference on, pages 2494–2498. IEEE, 2014.
    17. Lee Ngee Tan and Abeer Alwan. Multi-band summary correlogram-based pitch detection for noisy speech. Speech Communication, 55(7-8):841–856, September 2013.
    18. Jesper Kjær Nielsen, Tobias Lindstrøm Jensen, Jesper Rindom Jensen, Mads Græsbøll Christensen, and Søren Holdt Jensen. Fast fundamental frequency estimation: Making a statistically

  2. US Means of Transportation to Work Census Data

    • kaggle.com
    Updated Feb 23, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sagar G (2022). US Means of Transportation to Work Census Data [Dataset]. https://www.kaggle.com/goswamisagard/american-census-survey-b08301-cleaned-csv-data/discussion
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 23, 2022
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Sagar G
    Area covered
    United States
    Description

    US Census Bureau conducts American Census Survey 1 and 5 Yr surveys that record various demographics and provide public access through APIs. I have attempted to call the APIs through the python environment using the requests library, Clean, and organize the data in a usable format.

    Data Ingestion and Cleaning:

    ACS Subject data [2011-2019] was accessed using Python by following the below API Link: https://api.census.gov/data/2011/acs/acs1?get=group(B08301)&for=county:* The data was obtained in JSON format by calling the above API, then imported as Python Pandas Dataframe. The 84 variables returned have 21 Estimate values for various metrics, 21 pairs of respective Margin of Error, and respective Annotation values for Estimate and Margin of Error Values. This data was then undergone through various cleaning processes using Python, where excess variables were removed, and the column names were renamed. Web-Scraping was carried out to extract the variables' names and replace the codes in the column names in raw data.

    The above step was carried out for multiple ACS/ACS-1 datasets spanning 2011-2019 and then merged into a single Python Pandas Dataframe. The columns were rearranged, and the "NAME" column was split into two columns, namely 'StateName' and 'CountyName.' The counties for which no data was available were also removed from the Dataframe. Once the Dataframe was ready, it was separated into two new dataframes for separating State and County Data and exported into '.csv' format

    Data Source:

    More information about the source of Data can be found at the URL below: US Census Bureau. (n.d.). About: Census Bureau API. Retrieved from Census.gov https://www.census.gov/data/developers/about.html

    Final Word:

    I hope this data helps you to create something beautiful, and awesome. I will be posting a lot more databases shortly, if I get more time from assignments, submissions, and Semester Projects 🧙🏼‍♂️. Good Luck.

  3. f

    Enhancing UNCDF Operations: Power BI Dashboard Development and Data Mapping

    • figshare.com
    Updated Jan 6, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Maryam Binti Haji Abdul Halim (2025). Enhancing UNCDF Operations: Power BI Dashboard Development and Data Mapping [Dataset]. http://doi.org/10.6084/m9.figshare.28147451.v1
    Explore at:
    Dataset updated
    Jan 6, 2025
    Dataset provided by
    figshare
    Authors
    Maryam Binti Haji Abdul Halim
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This project focuses on data mapping, integration, and analysis to support the development and enhancement of six UNCDF operational applications: OrgTraveler, Comms Central, Internal Support Hub, Partnership 360, SmartHR, and TimeTrack. These apps streamline workflows for travel claims, internal support, partnership management, and time tracking within UNCDF.Key Features and Tools:Data Mapping for Salesforce CRM Migration: Structured and mapped data flows to ensure compatibility and seamless migration to Salesforce CRM.Python for Data Cleaning and Transformation: Utilized pandas, numpy, and APIs to clean, preprocess, and transform raw datasets into standardized formats.Power BI Dashboards: Designed interactive dashboards to visualize workflows and monitor performance metrics for decision-making.Collaboration Across Platforms: Integrated Google Collab for code collaboration and Microsoft Excel for data validation and analysis.

  4. s

    Data from: Nairobi Motorcycle Transit Comparison Dataset: Fuel vs. Electric...

    • scholardata.sun.ac.za
    • data.mendeley.com
    Updated Mar 8, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Martin Kitetu; Alois Mbutura; Halloran Stratford; MJ Booysen (2025). Nairobi Motorcycle Transit Comparison Dataset: Fuel vs. Electric Vehicle Performance Tracking (2023) [Dataset]. http://doi.org/10.25413/sun.28554200.v1
    Explore at:
    Dataset updated
    Mar 8, 2025
    Dataset provided by
    SUNScholarData
    Authors
    Martin Kitetu; Alois Mbutura; Halloran Stratford; MJ Booysen
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Nairobi
    Description

    This dataset contains GPS tracking data and performance metrics for motorcycle taxis (boda bodas) in Nairobi, Kenya, comparing traditional internal combustion engine (ICE) motorcycles with electric motorcycles. The study was conducted in two phases:Baseline Phase: 118 ICE motorcycles tracked over 14 days (2023-11-13 to 2023-11-26)Transition Phase: 108 ICE motorcycles (control) and 9 electric motorcycles (treatment) tracked over 12 days (2023-12-10 to 2023-12-21)The dataset is organised into two main categories:Trip Data: Individual trip-level records containing timing, distance, duration, location, and speed metricsDaily Data: Daily aggregated summaries containing usage metrics, economic data, and energy consumptionThis dataset enables comparative analysis of electric vs. ICE motorcycle performance, economic modelling of transportation costs, environmental impact assessment, urban mobility pattern analysis, and energy efficiency studies in emerging markets.Institutions:EED AdvisoryClean Air TaskforceStellenbosch UniversitySteps to reproduce:Raw Data CollectionGPS tracking devices installed on motorcycles, collecting location data at 10-second intervalsRider-reported information on revenue, maintenance costs, and fuel/electricity usageProcessing StepsGPS data cleaning: Filtered invalid coordinates, removed duplicates, interpolated missing pointsTrip identification: Defined by >1 minute stationary periods or ignition cyclesTrip metrics calculation: Distance, duration, idle time, average/max speedsDaily data aggregation: Summed by user_id and date with self-reported economic dataValidation: Cross-checked with rider logs and known routesAnonymisation: Removed start and end coordinates for first and last trips of each day to protect rider privacy and home locationsTechnical InformationGeographic coverage: Nairobi, KenyaTime period: November-December 2023Time zone: UTC+3 (East Africa Time)Currency: Kenyan Shillings (KES)Data format: CSV filesSoftware used: Python 3.8 (pandas, numpy, geopy)Notes: Some location data points are intentionally missing to protect rider privacy. Self-reported economic and energy consumption data has some missing values where riders did not report.CategoriesMotorcycle, Transportation in Africa, Electric Vehicles

  5. Database of Uniaxial Cyclic and Tensile Coupon Tests for Structural Metallic...

    • zenodo.org
    • data.niaid.nih.gov
    bin, csv, zip
    Updated Dec 24, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Alexander R. Hartloper; Alexander R. Hartloper; Selimcan Ozden; Albano de Castro e Sousa; Dimitrios G. Lignos; Dimitrios G. Lignos; Selimcan Ozden; Albano de Castro e Sousa (2022). Database of Uniaxial Cyclic and Tensile Coupon Tests for Structural Metallic Materials [Dataset]. http://doi.org/10.5281/zenodo.6965147
    Explore at:
    bin, zip, csvAvailable download formats
    Dataset updated
    Dec 24, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Alexander R. Hartloper; Alexander R. Hartloper; Selimcan Ozden; Albano de Castro e Sousa; Dimitrios G. Lignos; Dimitrios G. Lignos; Selimcan Ozden; Albano de Castro e Sousa
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Database of Uniaxial Cyclic and Tensile Coupon Tests for Structural Metallic Materials

    Background

    This dataset contains data from monotonic and cyclic loading experiments on structural metallic materials. The materials are primarily structural steels and one iron-based shape memory alloy is also included. Summary files are included that provide an overview of the database and data from the individual experiments is also included.

    The files included in the database are outlined below and the format of the files is briefly described. Additional information regarding the formatting can be found through the post-processing library (https://github.com/ahartloper/rlmtp/tree/master/protocols).

    Usage

    • The data is licensed through the Creative Commons Attribution 4.0 International.
    • If you have used our data and are publishing your work, we ask that you please reference both:
      1. this database through its DOI, and
      2. any publication that is associated with the experiments. See the Overall_Summary and Database_References files for the associated publication references.

    Included Files

    • Overall_Summary_2022-08-25_v1-0-0.csv: summarises the specimen information for all experiments in the database.
    • Summarized_Mechanical_Props_Campaign_2022-08-25_v1-0-0.csv: summarises the average initial yield stress and average initial elastic modulus per campaign.
    • Unreduced_Data-#_v1-0-0.zip: contain the original (not downsampled) data
      • Where # is one of: 1, 2, 3, 4, 5, 6. The unreduced data is broken into separate archives because of upload limitations to Zenodo. Together they provide all the experimental data.
      • We recommend you un-zip all the folders and place them in one "Unreduced_Data" directory similar to the "Clean_Data"
      • The experimental data is provided through .csv files for each test that contain the processed data. The experiments are organised by experimental campaign and named by load protocol and specimen. A .pdf file accompanies each test showing the stress-strain graph.
      • There is a "db_tag_clean_data_map.csv" file that is used to map the database summary with the unreduced data.
      • The computed yield stresses and elastic moduli are stored in the "yield_stress" directory.
    • Clean_Data_v1-0-0.zip: contains all the downsampled data
      • The experimental data is provided through .csv files for each test that contain the processed data. The experiments are organised by experimental campaign and named by load protocol and specimen. A .pdf file accompanies each test showing the stress-strain graph.
      • There is a "db_tag_clean_data_map.csv" file that is used to map the database summary with the clean data.
      • The computed yield stresses and elastic moduli are stored in the "yield_stress" directory.
    • Database_References_v1-0-0.bib
      • Contains a bibtex reference for many of the experiments in the database. Corresponds to the "citekey" entry in the summary files.

    File Format: Downsampled Data

    These are the "LP_

    • The header of the first column is empty: the first column corresponds to the index of the sample point in the original (unreduced) data
    • Time[s]: time in seconds since the start of the test
    • e_true: true strain
    • Sigma_true: true stress in MPa
    • (optional) Temperature[C]: the surface temperature in degC

    These data files can be easily loaded using the pandas library in Python through:

    import pandas
    data = pandas.read_csv(data_file, index_col=0)

    The data is formatted so it can be used directly in RESSPyLab (https://github.com/AlbanoCastroSousa/RESSPyLab). Note that the column names "e_true" and "Sigma_true" were kept for backwards compatibility reasons with RESSPyLab.

    File Format: Unreduced Data

    These are the "LP_

    • The first column is the index of each data point
    • S/No: sample number recorded by the DAQ
    • System Date: Date and time of sample
    • Time[s]: time in seconds since the start of the test
    • C_1_Force[kN]: load cell force
    • C_1_Déform1[mm]: extensometer displacement
    • C_1_Déplacement[mm]: cross-head displacement
    • Eng_Stress[MPa]: engineering stress
    • Eng_Strain[]: engineering strain
    • e_true: true strain
    • Sigma_true: true stress in MPa
    • (optional) Temperature[C]: specimen surface temperature in degC

    The data can be loaded and used similarly to the downsampled data.

    File Format: Overall_Summary

    The overall summary file provides data on all the test specimens in the database. The columns include:

    • hidden_index: internal reference ID
    • grade: material grade
    • spec: specifications for the material
    • source: base material for the test specimen
    • id: internal name for the specimen
    • lp: load protocol
    • size: type of specimen (M8, M12, M20)
    • gage_length_mm_: unreduced section length in mm
    • avg_reduced_dia_mm_: average measured diameter for the reduced section in mm
    • avg_fractured_dia_top_mm_: average measured diameter of the top fracture surface in mm
    • avg_fractured_dia_bot_mm_: average measured diameter of the bottom fracture surface in mm
    • fy_n_mpa_: nominal yield stress
    • fu_n_mpa_: nominal ultimate stress
    • t_a_deg_c_: ambient temperature in degC
    • date: date of test
    • investigator: person(s) who conducted the test
    • location: laboratory where test was conducted
    • machine: setup used to conduct test
    • pid_force_k_p, pid_force_t_i, pid_force_t_d: PID parameters for force control
    • pid_disp_k_p, pid_disp_t_i, pid_disp_t_d: PID parameters for displacement control
    • pid_extenso_k_p, pid_extenso_t_i, pid_extenso_t_d: PID parameters for extensometer control
    • citekey: reference corresponding to the Database_References.bib file
    • yield_stress_mpa_: computed yield stress in MPa
    • elastic_modulus_mpa_: computed elastic modulus in MPa
    • fracture_strain: computed average true strain across the fracture surface
    • c,si,mn,p,s,n,cu,mo,ni,cr,v,nb,ti,al,b,zr,sn,ca,h,fe: chemical compositions in units of %mass
    • file: file name of corresponding clean (downsampled) stress-strain data

    File Format: Summarized_Mechanical_Props_Campaign

    Meant to be loaded in Python as a pandas DataFrame with multi-indexing, e.g.,

    tab1 = pd.read_csv('Summarized_Mechanical_Props_Campaign_' + date + version + '.csv',
              index_col=[0, 1, 2, 3], skipinitialspace=True, header=[0, 1],
              keep_default_na=False, na_values='')
    • citekey: reference in "Campaign_References.bib".
    • Grade: material grade.
    • Spec.: specifications (e.g., J2+N).
    • Yield Stress [MPa]: initial yield stress in MPa
      • size, count, mean, coefvar: number of experiments in campaign, number of experiments in mean, mean value for campaign, coefficient of variation for campaign
    • Elastic Modulus [MPa]: initial elastic modulus in MPa
      • size, count, mean, coefvar: number of experiments in campaign, number of experiments in mean, mean value for campaign, coefficient of variation for campaign

    Caveats

    • The files in the following directories were tested before the protocol was established. Therefore, only the true stress-strain is available for each:
      • A500
      • A992_Gr50
      • BCP325
      • BCR295
      • HYP400
      • S460NL
      • S690QL/25mm
      • S355J2_Plates/S355J2_N_25mm and S355J2_N_50mm
  6. "9,565 Top-Rated Movies Dataset"

    • kaggle.com
    Updated Aug 19, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Harshit@85 (2024). "9,565 Top-Rated Movies Dataset" [Dataset]. https://www.kaggle.com/datasets/harshit85/9565-top-rated-movies-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 19, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Harshit@85
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    About the Dataset

    Title: 9,565 Top-Rated Movies Dataset

    Description:
    This dataset offers a comprehensive collection of 9,565 of the highest-rated movies according to audience ratings on the Movie Database (TMDb). The dataset includes detailed information about each movie, such as its title, overview, release date, popularity score, average vote, and vote count. It is designed to be a valuable resource for anyone interested in exploring trends in popular cinema, analyzing factors that contribute to a movie’s success, or building recommendation engines.

    Key Features: - Title: The official title of each movie. - Overview: A brief synopsis or description of the movie's plot. - Release Date: The release date of the movie, formatted as YYYY-MM-DD. - Popularity: A score indicating the current popularity of the movie on TMDb, which can be used to gauge current interest. - Vote Average: The average rating of the movie, based on user votes. - Vote Count: The total number of votes the movie has received.

    Data Source: The data was sourced from the TMDb API, a well-regarded platform for movie information, using the /movie/top_rated endpoint. The dataset represents a snapshot of the highest-rated movies as of the time of data collection.

    Data Collection Process: - API Access: Data was retrieved programmatically using TMDb’s API. - Pagination Handling: Multiple API requests were made to cover all pages of top-rated movies, ensuring the dataset’s comprehensiveness. - Data Aggregation: Collected data was aggregated into a single, unified dataset using the pandas library. - Cleaning: Basic data cleaning was performed to remove duplicates and handle missing or malformed data entries.

    Potential Uses: - Trend Analysis: Analyze trends in movie ratings over time or compare ratings across different genres. - Recommendation Systems: Build and train models to recommend movies based on user preferences. - Sentiment Analysis: Perform text analysis on movie overviews to understand common themes and sentiments. - Statistical Analysis: Explore the relationship between popularity, vote count, and average ratings.

    Data Format: The dataset is provided in a structured tabular format (e.g., CSV), making it easy to load into data analysis tools like Python, R, or Excel.

    Usage License: The dataset is shared under [appropriate license], ensuring that it can be used for educational, research, or commercial purposes, with proper attribution to the data source (TMDb).

    This description provides a clear and detailed overview, helping potential users understand the dataset's content, origin, and potential applications.

  7. Z

    Data from: Actionable and Interpretable Fault Localization for Recurring...

    • data.niaid.nih.gov
    • zenodo.org
    Updated Aug 3, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Li, Zeyan (2022). Actionable and Interpretable Fault Localization for Recurring Failures in Online Service Systems [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6955908
    Explore at:
    Dataset updated
    Aug 3, 2022
    Dataset authored and provided by
    Li, Zeyan
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    These are the datasets for our ESEC/FSE'22 paper "Actionable and Interpretable Fault Localization for Recurring Failures in Online Service Systems." In each dataset, graph.yml or graphs/*.yml are FDGs, metrics.csv is metrics, and faults.csv is failures (including ground truths).FDG.pkl is a pickle of the FDG object, which contains all the above data. Note that the pickle files are not compatible in different Python and Pandas versions. So if you cannot load the pickles, just ignore and delete them. They are only used to speed up data load.

    See more at https://github.com/NetManAIOps/DejaVu

  8. Singapore Residents dataset

    • kaggle.com
    Updated Aug 28, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anuj_sahay (2019). Singapore Residents dataset [Dataset]. https://www.kaggle.com/anujsahay112/singapore-residents-dataset/kernels
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 28, 2019
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Anuj_sahay
    Area covered
    Singapore
    Description

    Context

    This dataset is in context of the real world data science work and how the data analyst and data scientist work.

    Content

    The dataset consists of four columns Year, Level_1(Ethnic group/gender), Level_2(Age group), and population

    Acknowledgements

    I would sincerely thank GeoIQ for sharing this dataset with me along with tasks. Just having a basic knowledge of Pandas and Numpy and other python data science libraries is not enough. How can you execute tasks and how can you preprocess the data before making any prediction is very important. Most of the datasets in Kaggle are clean and well arranged but this dataset thought me how real world data science and analysis works. Every data science beginner must work on this dataset and try to execute the tasks. It would only give them a good exposer to the real data science world.

    Inspiration

    1. Identify the largest Ethnic group in Singapore. Their average population growth over the years and what proportion of the total population do they constitute.
    2. Identify the largest age group in Singapore. Their average population growth over the years and what proportion of the total population do they constitute.
    3. Identify the group (by age, ethnicity and gender) that: a. Has shown the highest growth rate b. Has shown the lowest growth rate c. Has remained the same
    4. Plot a graph for population trends
  9. image-impeccable

    • huggingface.co
    Updated May 11, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ThinkOnward (2025). image-impeccable [Dataset]. https://huggingface.co/datasets/thinkonward/image-impeccable
    Explore at:
    Dataset updated
    May 11, 2025
    Dataset provided by
    Think Onward LLC
    Authors
    ThinkOnward
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Dataset Card for Image Impeccable

      Dataset Description
    

    This data was produced by ThinkOnward for the Image Impeccable Challenge, using a synthetic seismic dataset generator called Synthoseis.

    Created by: Mike McIntire and Jesse Pisel License: CC 4.0

      Uses
    
    
    
    
    
    
    
      How to generate a dataset
    

    This dataset is provided as paired noisy and clean seismic volumes. Follow the following step to load the data to numpy volumes import pandas as pd import numpy as… See the full description on the dataset page: https://huggingface.co/datasets/thinkonward/image-impeccable.

  10. Data from: BSRN solar radiation data for the testing, validation and...

    • zenodo.org
    • portaldelainvestigacion.uma.es
    • +1more
    bin
    Updated Feb 11, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jose A Ruiz-Arias; Jose A Ruiz-Arias (2024). BSRN solar radiation data for the testing, validation and benchmarking of solar irradiance components separation models [Dataset]. http://doi.org/10.5281/zenodo.10593079
    Explore at:
    binAvailable download formats
    Dataset updated
    Feb 11, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Jose A Ruiz-Arias; Jose A Ruiz-Arias
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The dataset is an excerpt of the validation dataset used in:

    Ruiz-Arias JA, Gueymard CA. Review and performance benchmarking of 1-min solar irradiance components separation methods: The critical role of dynamically-constrained sky conditions. Submitted for publication to Renewable and Sustainable Energy Reviews.

    and it is ready to use in the Python package splitting_models developed during that research. See the documentation in the Python package for usage details. Below, there is a detailed description of the dataset.

    The data is in a single parquet file that contains 1-min time series of solar geometry, clear-sky solar irradiance simulations, solar irradiance observations and CAELUS sky types for 5 BSRN sites, one per primary Köppen-Geiger climate, namely: Minamitorishima (mnm), JP, for equatorial climate; Alice Springs (asp), AU, for dry climate; Carpentras (car), FR, for temperate climate; Bondville (bon), US, for continental climate; and Sonnblick (son), AT, for cold/polar/snow climate. It includes one calendar year per site. The BSRN data is publicly available. See download instructions in https://bsrn.awi.de/data.

    The specific variables included in the dataset are:

    • climate: primary Köppen-Geiger climate. Values are: A (equatorial), B (dry), C (temperate), D (continental) and E (polar/snow).
    • longitude: longitude, in degrees east.
    • latitude: latitude, in degrees north.
    • sza: solar zenith angle, in degrees.
    • eth: extraterrestrial solar irradiance (i.e., top of atmosphere solar irradiance), in W/m2.
    • ghics: clear-sky global solar irradiance, in W/m2. It is evaluated with the SPARTA clear-sky model and MERRA-2 clear-sky atmosphere.
    • difcs: clear-sky diffuse solar irradiance, in W/m2.It is evaluated with the SPARTA clear-sky model and MERRA-2 clear-sky atmosphere.
    • ghicda: clean-and-dry clear-sky global solar irradiance, in W/m2. It is evaluated with the SPARTA clear-sky model and MERRA-2 clear-sky atmosphere, prescribing zero aerosols and zero precipitable water.
    • ghi: observed global horizontal irradiance, in W/m2.
    • dif: observed diffuse irradiance, in W/m2.
    • sky_type: CAELUS sky type. Values are: 1 (unknown), 2 (overcast), 3 (thick clouds), 4 (scattered clouds), 5 (thin clouds), 6 (cloudless) and 7 (cloud enhancement).

    The dataset can be easily loaded in a Python Pandas DataFrame as follows:

    import pandas as pd

    data = pd.read_parquet(

    The dataframe has a multi-index with two levels: times_utc and site. The former are the UTC timestamps at the center of each 1-min interval. The latter is each site's label.

  11. t

    Tour Recommendation Model

    • test.researchdata.tuwien.at
    • test.researchdata.tuwien.ac.at
    bin, png +1
    Updated May 14, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Muhammad Mobeel Akbar; Muhammad Mobeel Akbar; Muhammad Mobeel Akbar; Muhammad Mobeel Akbar (2025). Tour Recommendation Model [Dataset]. http://doi.org/10.70124/akpf6-8p175
    Explore at:
    text/markdown, png, binAvailable download formats
    Dataset updated
    May 14, 2025
    Dataset provided by
    TU Wien
    Authors
    Muhammad Mobeel Akbar; Muhammad Mobeel Akbar; Muhammad Mobeel Akbar; Muhammad Mobeel Akbar
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Apr 28, 2025
    Description

    Dataset Description for Tour Recommendation Model

    Context and Methodology:

    • Research Domain/Project:
      This dataset is part of the Tour Recommendation System project, which focuses on predicting user preferences and ratings for various tourist places and events. It belongs to the field of Machine Learning, specifically applied to Recommender Systems and Predictive Analytics.

    • Purpose:
      The dataset serves as the training and evaluation data for a Decision Tree Regressor model, which predicts ratings (from 1-5) for different tourist destinations based on user preferences. The model can be used to recommend places or events to users based on their predicted ratings.

    • Creation Methodology:
      The dataset was originally collected from a tourism platform where users rated various tourist places and events. The data was preprocessed to remove missing or invalid entries (such as #NAME? in rating columns). It was then split into subsets for training, validation, and testing the model.

    Technical Details:

    • Structure of the Dataset:
      The dataset is stored as a CSV file (user_ratings_dataset.csv) and contains the following columns:

      • place_or_event_id: Unique identifier for each tourist place or event.

      • rating: Rating given by the user, ranging from 1 to 5.

      The data is split into three subsets:

      • Training Set: 80% of the dataset used to train the model.

      • Validation Set: A small portion used for hyperparameter tuning.

      • Test Set: 20% used to evaluate model performance.

    • Folder and File Naming Conventions:
      The dataset files are stored in the following structure:

      • user_ratings_dataset.csv: The original dataset file containing user ratings.

      • tour_recommendation_model.pkl: The saved model after training.

      • actual_vs_predicted_chart.png: A chart comparing actual and predicted ratings.

    • Software Requirements:
      To open and work with this dataset, the following software and libraries are required:

      • Python 3.x

      • Pandas for data manipulation

      • Scikit-learn for training and evaluating machine learning models

      • Matplotlib for chart generation

      • Joblib for saving and loading the trained model

      The dataset can be opened and processed using any Python environment that supports these libraries.

    • Additional Resources:

      • The model training code, README file, and performance chart are available in the project repository.

      • For detailed explanation and code, please refer to the GitHub repository (or any other relevant link for the code).

    Further Details:

    • Dataset Reusability:
      The dataset is structured for easy use in training machine learning models for recommendation systems. Researchers and practitioners can utilize it to:

      • Train other types of models (e.g., regression, classification).

      • Experiment with different features or add more metadata to enrich the dataset.

    • Data Integrity:
      The dataset has been cleaned and preprocessed to remove invalid values (such as #NAME? or missing ratings). However, users should ensure they understand the structure and the preprocessing steps taken before reusing it.

    • Licensing:
      The dataset is provided under the CC BY 4.0 license, which allows free usage, distribution, and modification, provided that proper attribution is given.

  12. 30 Short Tips for Your Data Scientist Interview

    • kaggle.com
    Updated Oct 12, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Skillslash17 (2023). 30 Short Tips for Your Data Scientist Interview [Dataset]. https://www.kaggle.com/datasets/skillslash17/30-short-tips-for-your-data-scientist-interview
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 12, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Skillslash17
    Description

    If you’re a data scientist looking to get ahead in the ever-changing world of data science, you know that job interviews are a crucial part of your career. But getting a job as a data scientist is not just about being tech-savvy, it’s also about having the right skillset, being able to solve problems, and having good communication skills. With competition heating up, it’s important to stand out and make a good impression on potential employers.

    Data Science has become an essential part of the contemporary business environment, enabling decision-making in a variety of industries. Consequently, organizations are increasingly looking for individuals who can utilize the power of data to generate new ideas and expand their operations. However these roles come with a high level of expectation, requiring applicants to possess a comprehensive knowledge of data analytics and machine learning, as well as the capacity to turn their discoveries into practical solutions.

    With so many job seekers out there, it’s super important to be prepared and confident for your interview as a data scientist.

    Here are 30 tips to help you get the most out of your interview and land the job you want. No matter if you’re just starting out or have been in the field for a while, these tips will help you make the most of your interview and set you up for success.

    Technical Preparation

    Qualifying for a job as a data scientist needs a comprehensive level of technical preparation. Job seekers are often required to demonstrate their technical skills in order to show their ability to effectively fulfill the duties of the role. Here are a selection of key tips for technical proficiency:

    1 Master the Basics

    Make sure you have a good understanding of statistics, math, and programming languages such as Python and R.

    2 Understand Machine Learning

    Gain an in-depth understanding of commonly used machine learning techniques, including linear regression and decision trees, as well as neural networks.

    3 Data Manipulation

    Make sure you're good with data tools like Pandas and Matplotlib, as well as data visualization tools like Seaborn.

    4 SQL Skills

    Gain proficiency in the use of SQL language to extract and process data from databases.

    5 Feature Engineering

    Understand and know the importance of feature engineering and how to create meaningful features from raw data.

    6 Model Evaluation

    Learn to assess and compare machine learning models using metrics like accuracy, precision, recall, and F1-score.

    7 Big Data Technologies

    If the job requires it, become familiar with big data technologies like Hadoop and Spark.

    8 Coding Challenges

    Practice coding challenges related to data manipulation and machine learning on platforms like LeetCode and Kaggle.

    Portfolio and Projects

    9 Build a Portfolio

    Develop a portfolio of your data science projects that outlines your methodology, the resources you have employed, and the results achieved.

    10 Kaggle Competitions

    Participate in Kaggle competitions to gain real-world experience and showcase your problem-solving skills.

    11 Open Source Contributions

    Contribute to open-source data science projects to demonstrate your collaboration and coding abilities.

    12 GitHub Profile

    Maintain a well-organized GitHub profile with clean code and clear project documentation.

    Domain Knowledge

    13 Understand the Industry

    Research the industry you’re applying to and understand its specific data challenges and opportunities.

    14 Company Research

    Study the company you’re interviewing with to tailor your responses and show your genuine interest.

    Soft Skills

    15 Communication

    Practice explaining complex concepts in simple terms. Data Scientists often need to communicate findings to non-technical stakeholders.

    16 Problem-Solving

    Focus on your problem-solving abilities and how you approach complex challenges.

    17 Adaptability

    Highlight your ability to adapt to new technologies and techniques as the field of data science evolves.

    Interview Etiquette

    18 Professional Appearance

    Dress and present yourself in a professional manner, whether the interview is in person or remote.

    19 Punctuality

    Be on time for the interview, whether it’s virtual or in person.

    20 Body Language

    Maintain good posture and eye contact during the interview. Smile and exhibit confidence.

    21 Active Listening

    Pay close attention to the interviewer's questions and answer them directly.

    Behavioral Questions

    22 STAR Method

    Use the STAR (Situation, Task, Action, Result) method to structure your responses to behavioral questions.

    23 Conflict Resolution

    Be prepared to discuss how you have handled conflicts or challenging situations in previous roles.

    24 Teamwork

    Highlight instances where you’ve worked effectively in cross-functional teams...

  13. o

    BSRN solar radiation data for the testing, validation and benchmarking of...

    • explore.openaire.eu
    Updated Jan 30, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    BSRN solar radiation data for the testing, validation and benchmarking of solar irradiance components separation models [Dataset]. https://explore.openaire.eu/search/dataset?pid=10.5281/zenodo.10593079
    Explore at:
    Dataset updated
    Jan 30, 2024
    Authors
    Jose A Ruiz-Arias
    Description

    The dataset is an excerpt of the validation dataset used in: Ruiz-Arias JA, Gueymard CA. Review and performance benchmarking of 1-min solar irradiance components separation methods: The critical role of dynamically-constrained sky conditions. Submitted for publication to Renewable and Sustainable Energy Reviews. and it is ready to use in the Python package splitting_models developed during that research. See the documentation in the Python package for usage details. Below, there is a detailed description of the dataset. The data is in a single parquet file that contains 1-min time series of solar geometry, clear-sky solar irradiance simulations, solar irradiance observations and CAELUS sky types for 5 BSRN sites, one per primary Köppen-Geiger climate, namely: Minamitorishima (mnm), JP, for equatorial climate; Alice Springs (asp), AU, for dry climate; Carpentras (car), FR, for temperate climate; Bondville (bon), US, for continental climate; and Sonnblick (son), AT, for cold/polar/snow climate. It includes one calendar year per site. The BSRN data is publicly available. See download instructions in https://bsrn.awi.de/data. The specific variables included in the dataset are: climate: primary Köppen-Geiger climate. Values are: A (equatorial), B (dry), C (temperate), D (continental) and E (polar/snow). longitude: longitude, in degrees east. latitude: latitude, in degrees north. sza: solar zenith angle, in degrees. eth: extraterrestrial solar irradiance (i.e., top of atmosphere solar irradiance), in W/m2. ghics: clear-sky global solar irradiance, in W/m2. It is evaluated with the SPARTA clear-sky model and MERRA-2 clear-sky atmosphere. difcs: clear-sky diffuse solar irradiance, in W/m2.It is evaluated with the SPARTA clear-sky model and MERRA-2 clear-sky atmosphere. ghicda: clean-and-dry clear-sky global solar irradiance, in W/m2. It is evaluated with the SPARTA clear-sky model and MERRA-2 clear-sky atmosphere, prescribing zero aerosols and zero precipitable water. ghi: observed global horizontal irradiance, in W/m2. dif: observed diffuse irradiance, in W/m2. sky_type: CAELUS sky type. Values are: 1 (unknown), 2 (overcast), 3 (thick clouds), 4 (scattered clouds), 5 (thin clouds), 6 (cloudless) and 7 (cloud enhancement). The dataset can be easily loaded in a Python Pandas DataFrame as follows: import pandas as pd data = pd.read_parquet() The dataframe has a multi-index with two levels: times_utc and site. The former are the UTC timestamps at the center of each 1-min interval. The latter is each site's label.

  14. Z

    The S&M-HSTPM2d5 dataset: High Spatial-Temporal Resolution PM 2.5 Measures...

    • data.niaid.nih.gov
    Updated Sep 25, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Eng, Kent X. (2020). The S&M-HSTPM2d5 dataset: High Spatial-Temporal Resolution PM 2.5 Measures in Multiple Cities Sensed by Static & Mobile Devices [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4028129
    Explore at:
    Dataset updated
    Sep 25, 2020
    Dataset provided by
    Noh, Hae Young
    Liu, Xinyu
    Liu, Jingxiao
    Chen, Xinlei
    Zhang, Lin
    Eng, Kent X.
    Zhang, Pei
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This S&M-HSTPM2d5 dataset contains the high spatial and temporal resolution of the particulates (PM2.5) measures with the corresponding timestamp and GPS location of mobile and static devices in the three Chinese cities: Foshan, Cangzhou, and Tianjin. Different numbers of static and mobile devices were set up in each city. The sampling rate was set up as one minute in Cangzhou, and three seconds in Foshan and Tianjin. For the specific detail of the setup, please refer to the Device_Setup_Description.txt file in this repository and the data descriptor paper.

    After the data collection process, the data cleaning process was performed to remove and adjust the abnormal and drifting data. The script of the data cleaning algorithm is provided in this repository. The data cleaning algorithm only adjusts or removes individual data points. The removal of the entire device's data was done after the data cleaning algorithm with empirical judgment and graphic visualization. For specific detail of the data cleaning process, please refer to the script (Data_cleaning_algorithm.ipynb) in this repository and the data descriptor paper.

    The dataset in this repository is the processed version. The raw dataset and removed devices are not included in this repository.

    The data is stored as a CSV file. Each CSV file which is named by the device ID represents the data that was collected by the corresponding device. Each CSV file has three types of data: timestamp as the China Standard Time (GMT+8), geographic location as latitude and longitude, and PM2.5 concentration with the unit of microgram per cubic meter. The CSV files are stored in either Static or Mobile folder which represents the devices' type. The Static and Mobile folder are stored in the corresponding city's folder.

    To access the dataset, any programming language that can access CSV files is appropriate. Users can also open the CSV file directly. The get_dataset.ipynb file in this repository also provides an option of accessing the dataset. To successfully execute ipynb file, Jupyter Notebook with Python 3.0 is required. The following python library is also required:

    get_dataset.ipynb: 1. os library 2. pandas library

    Data_cleaning_algorithm.ipynb: 1. os library 2. pandas library 3. datetime library 4. math library

    The instruction of installing the libraries above can be found online. After installing the Jupyter Notebook with Python 3.0 and the required libraries, users can try to open the ipynb file with Jupyter Notebook and follow the instruction inside the file.

    For questions or suggestions please e-mail Xinlei Chen

  15. Flipkart OnlineOrders

    • kaggle.com
    Updated Jun 22, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sabya (2020). Flipkart OnlineOrders [Dataset]. https://www.kaggle.com/sabya40/filpkart-onlineorders/kernels
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 22, 2020
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Sabya
    Description

    Context

    This dataset contains 6 months of Customer online orders. The data is simple but messy and unorganized. This for beginner and Intermediate level who want to improve there skills in Pandas, matplotlib, and seaborn.

    Content

    Dataset context columns like: crawl_timestamp, product_name, product_category_tree, retail_price, discounted_price, brand.

    The main focus is to clean the dataset and make it organized using pandas.

    Acknowledgements

    I wouldn't be here without the help of data.world. Thank You.

    Inspiration

    I have some questions for this Dataset: 1. What was the best month for sales? How much was earned that month? 2. What time should we display advertisements to maximize the likelihood of purchases? 3. Which category sold most in that six month period? 4. Top 10 products sold most in that six month period?

  16. n

    A dataset of 5 million city trees from 63 US cities: species, location,...

    • data.niaid.nih.gov
    • search.dataone.org
    • +1more
    zip
    Updated Aug 31, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dakota McCoy; Benjamin Goulet-Scott; Weilin Meng; Bulent Atahan; Hana Kiros; Misako Nishino; John Kartesz (2022). A dataset of 5 million city trees from 63 US cities: species, location, nativity status, health, and more. [Dataset]. http://doi.org/10.5061/dryad.2jm63xsrf
    Explore at:
    zipAvailable download formats
    Dataset updated
    Aug 31, 2022
    Dataset provided by
    The Biota of North America Program (BONAP)
    Worcester Polytechnic Institute
    Stanford University
    Cornell University
    Harvard University
    Authors
    Dakota McCoy; Benjamin Goulet-Scott; Weilin Meng; Bulent Atahan; Hana Kiros; Misako Nishino; John Kartesz
    License

    https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

    Area covered
    United States
    Description

    Sustainable cities depend on urban forests. City trees -- a pillar of urban forests -- improve our health, clean the air, store CO2, and cool local temperatures. Comparatively less is known about urban forests as ecosystems, particularly their spatial composition, nativity statuses, biodiversity, and tree health. Here, we assembled and standardized a new dataset of N=5,660,237 trees from 63 of the largest US cities. The data comes from tree inventories conducted at the level of cities and/or neighborhoods. Each data sheet includes detailed information on tree location, species, nativity status (whether a tree species is naturally occurring or introduced), health, size, whether it is in a park or urban area, and more (comprising 28 standardized columns per datasheet). This dataset could be analyzed in combination with citizen-science datasets on bird, insect, or plant biodiversity; social and demographic data; or data on the physical environment. Urban forests offer a rare opportunity to intentionally design biodiverse, heterogenous, rich ecosystems. Methods See eLife manuscript for full details. Below, we provide a summary of how the dataset was collected and processed.

    Data Acquisition We limited our search to the 150 largest cities in the USA (by census population). To acquire raw data on street tree communities, we used a search protocol on both Google and Google Datasets Search (https://datasetsearch.research.google.com/). We first searched the city name plus each of the following: street trees, city trees, tree inventory, urban forest, and urban canopy (all combinations totaled 20 searches per city, 10 each in Google and Google Datasets Search). We then read the first page of google results and the top 20 results from Google Datasets Search. If the same named city in the wrong state appeared in the results, we redid the 20 searches adding the state name. If no data were found, we contacted a relevant state official via email or phone with an inquiry about their street tree inventory. Datasheets were received and transformed to .csv format (if they were not already in that format). We received data on street trees from 64 cities. One city, El Paso, had data only in summary format and was therefore excluded from analyses.

    Data Cleaning All code used is in the zipped folder Data S5 in the eLife publication. Before cleaning the data, we ensured that all reported trees for each city were located within the greater metropolitan area of the city (for certain inventories, many suburbs were reported - some within the greater metropolitan area, others not). First, we renamed all columns in the received .csv sheets, referring to the metadata and according to our standardized definitions (Table S4). To harmonize tree health and condition data across different cities, we inspected metadata from the tree inventories and converted all numeric scores to a descriptive scale including “excellent,” “good”, “fair”, “poor”, “dead”, and “dead/dying”. Some cities included only three points on this scale (e.g., “good”, “poor”, “dead/dying”) while others included five (e.g., “excellent,” “good”, “fair”, “poor”, “dead”). Second, we used pandas in Python (W. McKinney & Others, 2011) to correct typos, non-ASCII characters, variable spellings, date format, units used (we converted all units to metric), address issues, and common name format. In some cases, units were not specified for tree diameter at breast height (DBH) and tree height; we determined the units based on typical sizes for trees of a particular species. Wherever diameter was reported, we assumed it was DBH. We standardized health and condition data across cities, preserving the highest granularity available for each city. For our analysis, we converted this variable to a binary (see section Condition and Health). We created a column called “location_type” to label whether a given tree was growing in the built environment or in green space. All of the changes we made, and decision points, are preserved in Data S9. Third, we checked the scientific names reported using gnr_resolve in the R library taxize (Chamberlain & Szöcs, 2013), with the option Best_match_only set to TRUE (Data S9). Through an iterative process, we manually checked the results and corrected typos in the scientific names until all names were either a perfect match (n=1771 species) or partial match with threshold greater than 0.75 (n=453 species). BGS manually reviewed all partial matches to ensure that they were the correct species name, and then we programmatically corrected these partial matches (for example, Magnolia grandifolia-- which is not a species name of a known tree-- was corrected to Magnolia grandiflora, and Pheonix canariensus was corrected to its proper spelling of Phoenix canariensis). Because many of these tree inventories were crowd-sourced or generated in part through citizen science, such typos and misspellings are to be expected. Some tree inventories reported species by common names only. Therefore, our fourth step in data cleaning was to convert common names to scientific names. We generated a lookup table by summarizing all pairings of common and scientific names in the inventories for which both were reported. We manually reviewed the common to scientific name pairings, confirming that all were correct. Then we programmatically assigned scientific names to all common names (Data S9). Fifth, we assigned native status to each tree through reference to the Biota of North America Project (Kartesz, 2018), which has collected data on all native and non-native species occurrences throughout the US states. Specifically, we determined whether each tree species in a given city was native to that state, not native to that state, or that we did not have enough information to determine nativity (for cases where only the genus was known). Sixth, some cities reported only the street address but not latitude and longitude. For these cities, we used the OpenCageGeocoder (https://opencagedata.com/) to convert addresses to latitude and longitude coordinates (Data S9). OpenCageGeocoder leverages open data and is used by many academic institutions (see https://opencagedata.com/solutions/academia). Seventh, we trimmed each city dataset to include only the standardized columns we identified in Table S4. After each stage of data cleaning, we performed manual spot checking to identify any issues.

  17. yahoo_finance_data_nse_2000_stocks

    • kaggle.com
    zip
    Updated Apr 11, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Stormblessed_Ash (2025). yahoo_finance_data_nse_2000_stocks [Dataset]. https://www.kaggle.com/datasets/ashvinvinodh97/yahoo-finance-data-nse-2000-stocks
    Explore at:
    zip(198144682 bytes)Available download formats
    Dataset updated
    Apr 11, 2025
    Authors
    Stormblessed_Ash
    License

    http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/

    Description

    This dataset contains daily OHLCV data for ~ 2000 Indian Stocks listed on the National Stock Exchange for all time. The columns are multi-index columns, so this needs to be taken into account when reading and using the data. Source : Yahoo Finance Type: All files are CSV format. Currency : INR

    All the tickers have been collected from here : https://www.nseindia.com/market-data/securities-available-for-trading

    If using pandas, the following function is a utility to read any of the CSV files: ``` import pandas as pd def read_ohlcv(filename): "read a given ohlcv data file downloaded from yfinance" return pd.read_csv( filename, skiprows=[0, 1, 2], # remove the multiindex rows that cause trouble names=["Date", "Close", "High", "Low", "Open", "Volume"], index_col="Date", parse_dates=["Date"], )

    dataset = read_ohlcv("ABCAPITAL.NS.csv")

  18. Heathrow Weather Data

    • kaggle.com
    Updated Apr 3, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jonathan Bowden (2021). Heathrow Weather Data [Dataset]. https://www.kaggle.com/datasets/bowdenjr/heathrow-weather-data
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 3, 2021
    Dataset provided by
    Kaggle
    Authors
    Jonathan Bowden
    Description

    Context

    Simple time series data for weather prediction time series projects.

    Content

    The data contains the following information from the UK Met Office location at London Heathrow Airport. The data runs from Jan 1948 to Oct 2020 and includes the following monthly data fields:

    • yyyy = Year
    • mm = Month
    • tmax = Maximum temperature (Celsius)
    • tmin = Minimum temperature (Celsius)
    • af = Count of Air Frost days in the given month
    • rain = Total rainfall (mm)
    • sun = Sunshine duration (hrs)

    Acknowledgements

    Provided by the UK Met Office: https://www.metoffice.gov.uk/research/climate/maps-and-data/historic-station-data Available under Open Government Licence: http://www.nationalarchives.gov.uk/doc/open-government-licence/version/3/

    Example code

    The following Python code will load into a Pandas DataFrame:

    colspecs = [(3, 7), (9,11),(14,18),(22,26),(32,34),(37,42),(45,50)] data = pd.read_fwf('../input/heathrow-weather-data/heathrowdata.txt',colspecs=colspecs)

    The following will remove the first few lines of text

    data = data[3:].reset_index(drop=True) data.columns = data.iloc[1] data = data[3:].reset_index(drop=True)

  19. Raw voltage and current traces for current-voltage (IV) relationships for...

    • figshare.com
    zip
    Updated May 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Paul Manis; Michael R. Kasten; Ruili Xie (2023). Raw voltage and current traces for current-voltage (IV) relationships for cochlear nucleus neurons. [Dataset]. http://doi.org/10.6084/m9.figshare.8854352.v1
    Explore at:
    zipAvailable download formats
    Dataset updated
    May 30, 2023
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Paul Manis; Michael R. Kasten; Ruili Xie
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Whole-cell tight seal current-clamp recordings from neurons in brain slices of mouse cochlear nucleus. These data are the responses to series of current steps (100 and 500 ms in duration), used to derive measures of intrinsic excitability, including input resistance, resting membrane potential, time constants, spike shape parameters, coefficient of variation of spike rate, and adaptation. The data were analyzed using the package ephysanalysis (https://github.com/pbmanis/ephysanalysis). The raw data here are in NWB format(https://neurodatawithoutborders.github.io/pynwb), and have been extracted from the main dataset.Additional files include the extracted parameters (pickled Pandas database), and Python source files used for the analysis. See README.md for more details.Source file CN_LDA.py updated, 9/4/2019. Minor edits to remove unused statements and update docstrings; no change in results.Preprint: bioRxiv 594713; doi: https://doi.org/10.1101/594713

  20. Weather Data, Armagh, N. Ireland

    • kaggle.com
    Updated Apr 3, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jonathan Bowden (2021). Weather Data, Armagh, N. Ireland [Dataset]. https://www.kaggle.com/bowdenjr/weather-data-armagh-n-ireland/tasks
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 3, 2021
    Dataset provided by
    Kaggle
    Authors
    Jonathan Bowden
    Area covered
    Northern Ireland, Ireland, Armagh
    Description

    Context Simple time series data for weather prediction time series projects.

    Content The data contains the following information from the UK Met Office location at Armagh, Northern Ireland. The data runs from Jan 1853 to Nov 2020 and includes the following monthly data fields:

    yyyy = Year mm = Month tmax = Maximum temperature (Celsius) tmin = Minimum temperature (Celsius) af = Count of Air Frost days in the given month rain = Total rainfall (mm) sun = Sunshine duration (hrs) Acknowledgements Provided by the UK Met Office: https://www.metoffice.gov.uk/research/climate/maps-and-data/historic-station-data Available under Open Government Licence: http://www.nationalarchives.gov.uk/doc/open-government-licence/version/3/

    Example code The following Python code will load into a Pandas DataFrame:

    colspecs = [(3, 7), (9,11),(14,18),(22,26),(32,34),(37,42),(45,50)] data = pd.read_fwf('../input/heathrow-weather-data/heathrowdata.txt',colspecs=colspecs)

    The following will remove the first few lines of text

    data = data[3:].reset_index(drop=True) data.columns = data.iloc[1] data = data[3:].reset_index(drop=True)

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Bastian Bechtold; Bastian Bechtold (2025). A Replication Dataset for Fundamental Frequency Estimation [Dataset]. http://doi.org/10.5281/zenodo.3904389
Organization logo

A Replication Dataset for Fundamental Frequency Estimation

Explore at:
binAvailable download formats
Dataset updated
Apr 24, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Bastian Bechtold; Bastian Bechtold
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Part of the dissertation Pitch of Voiced Speech in the Short-Time Fourier Transform: Algorithms, Ground Truths, and Evaluation Methods.
© 2020, Bastian Bechtold. All rights reserved.

Estimating the fundamental frequency of speech remains an active area of research, with varied applications in speech recognition, speaker identification, and speech compression. A vast number of algorithms for estimatimating this quantity have been proposed over the years, and a number of speech and noise corpora have been developed for evaluating their performance. The present dataset contains estimated fundamental frequency tracks of 25 algorithms, six speech corpora, two noise corpora, at nine signal-to-noise ratios between -20 and 20 dB SNR, as well as an additional evaluation of synthetic harmonic tone complexes in white noise.

The dataset also contains pre-calculated performance measures both novel and traditional, in reference to each speech corpus’ ground truth, the algorithms’ own clean-speech estimate, and our own consensus truth. It can thus serve as the basis for a comparison study, or to replicate existing studies from a larger dataset, or as a reference for developing new fundamental frequency estimation algorithms. All source code and data is available to download, and entirely reproducible, albeit requiring about one year of processor-time.

Included Code and Data

  • ground truth data.zip is a JBOF dataset of fundamental frequency estimates and ground truths of all speech files in the following corpora:
    • CMU-ARCTIC (consensus truth) [1]
    • FDA (corpus truth and consensus truth) [2]
    • KEELE (corpus truth and consensus truth) [3]
    • MOCHA-TIMIT (consensus truth) [4]
    • PTDB-TUG (corpus truth and consensus truth) [5]
    • TIMIT (consensus truth) [6]
  • noisy speech data.zip is a JBOF datasets of fundamental frequency estimates of speech files mixed with noise from the following corpora:
  • synthetic speech data.zip is a JBOF dataset of fundamental frequency estimates of synthetic harmonic tone complexes in white noise.
  • noisy_speech.pkl and synthetic_speech.pkl are pickled Pandas dataframes of performance metrics derived from the above data for the following list of fundamental frequency estimation algorithms:
  • noisy speech evaluation.py and synthetic speech evaluation.py are Python programs to calculate the above Pandas dataframes from the above JBOF datasets. They calculate the following performance measures:
    • Gross Pitch Error (GPE), the percentage of pitches where the estimated pitch deviates from the true pitch by more than 20%.
    • Fine Pitch Error (FPE), the mean error of grossly correct estimates.
    • High/Low Octave Pitch Error (OPE), the percentage pitches that are GPEs and happens to be at an integer multiple of the true pitch.
    • Gross Remaining Error (GRE), the percentage of pitches that are GPEs but not OPEs.
    • Fine Remaining Bias (FRB), the median error of GREs.
    • True Positive Rate (TPR), the percentage of true positive voicing estimates.
    • False Positive Rate (FPR), the percentage of false positive voicing estimates.
    • False Negative Rate (FNR), the percentage of false negative voicing estimates.
    • F₁, the harmonic mean of precision and recall of the voicing decision.
  • Pipfile is a pipenv-compatible pipfile for installing all prerequisites necessary for running the above Python programs.

The Python programs take about an hour to compute on a fast 2019 computer, and require at least 32 Gb of memory.

References:

  1. John Kominek and Alan W Black. CMU ARCTIC database for speech synthesis, 2003.
  2. Paul C Bagshaw, Steven Hiller, and Mervyn A Jack. Enhanced Pitch Tracking and the Processing of F0 Contours for Computer Aided Intonation Teaching. In EUROSPEECH, 1993.
  3. F Plante, Georg F Meyer, and William A Ainsworth. A Pitch Extraction Reference Database. In Fourth European Conference on Speech Communication and Technology, pages 837–840, Madrid, Spain, 1995.
  4. Alan Wrench. MOCHA MultiCHannel Articulatory database: English, November 1999.
  5. Gregor Pirker, Michael Wohlmayr, Stefan Petrik, and Franz Pernkopf. A Pitch Tracking Corpus with Evaluation on Multipitch Tracking Scenario. page 4, 2011.
  6. John S. Garofolo, Lori F. Lamel, William M. Fisher, Jonathan G. Fiscus, David S. Pallett, Nancy L. Dahlgren, and Victor Zue. TIMIT Acoustic-Phonetic Continuous Speech Corpus, 1993.
  7. Andrew Varga and Herman J.M. Steeneken. Assessment for automatic speech recognition: II. NOISEX-92: A database and an experiment to study the effect of additive noise on speech recog- nition systems. Speech Communication, 12(3):247–251, July 1993.
  8. David B. Dean, Sridha Sridharan, Robert J. Vogt, and Michael W. Mason. The QUT-NOISE-TIMIT corpus for the evaluation of voice activity detection algorithms. Proceedings of Interspeech 2010, 2010.
  9. Man Mohan Sondhi. New methods of pitch extraction. Audio and Electroacoustics, IEEE Transactions on, 16(2):262—266, 1968.
  10. Myron J. Ross, Harry L. Shaffer, Asaf Cohen, Richard Freudberg, and Harold J. Manley. Average magnitude difference function pitch extractor. Acoustics, Speech and Signal Processing, IEEE Transactions on, 22(5):353—362, 1974.
  11. Na Yang, He Ba, Weiyang Cai, Ilker Demirkol, and Wendi Heinzelman. BaNa: A Noise Resilient Fundamental Frequency Detection Algorithm for Speech and Music. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 22(12):1833–1848, December 2014.
  12. Michael Noll. Cepstrum Pitch Determination. The Journal of the Acoustical Society of America, 41(2):293–309, 1967.
  13. Jong Wook Kim, Justin Salamon, Peter Li, and Juan Pablo Bello. CREPE: A Convolutional Representation for Pitch Estimation. arXiv:1802.06182 [cs, eess, stat], February 2018. arXiv: 1802.06182.
  14. Masanori Morise, Fumiya Yokomori, and Kenji Ozawa. WORLD: A Vocoder-Based High-Quality Speech Synthesis System for Real-Time Applications. IEICE Transactions on Information and Systems, E99.D(7):1877–1884, 2016.
  15. Kun Han and DeLiang Wang. Neural Network Based Pitch Tracking in Very Noisy Speech. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 22(12):2158–2168, Decem- ber 2014.
  16. Pegah Ghahremani, Bagher BabaAli, Daniel Povey, Korbinian Riedhammer, Jan Trmal, and Sanjeev Khudanpur. A pitch extraction algorithm tuned for automatic speech recognition. In Acoustics, Speech and Signal Processing (ICASSP), 2014 IEEE International Conference on, pages 2494–2498. IEEE, 2014.
  17. Lee Ngee Tan and Abeer Alwan. Multi-band summary correlogram-based pitch detection for noisy speech. Speech Communication, 55(7-8):841–856, September 2013.
  18. Jesper Kjær Nielsen, Tobias Lindstrøm Jensen, Jesper Rindom Jensen, Mads Græsbøll Christensen, and Søren Holdt Jensen. Fast fundamental frequency estimation: Making a statistically

Search
Clear search
Close search
Google apps
Main menu