100+ datasets found
  1. CSV file used in statistical analyses

    • data.csiro.au
    • researchdata.edu.au
    • +1more
    Updated Oct 13, 2014
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    CSIRO (2014). CSV file used in statistical analyses [Dataset]. http://doi.org/10.4225/08/543B4B4CA92E6
    Explore at:
    Dataset updated
    Oct 13, 2014
    Dataset authored and provided by
    CSIROhttp://www.csiro.au/
    License

    https://research.csiro.au/dap/licences/csiro-data-licence/https://research.csiro.au/dap/licences/csiro-data-licence/

    Time period covered
    Mar 14, 2008 - Jun 9, 2009
    Dataset funded by
    CSIROhttp://www.csiro.au/
    Description

    A csv file containing the tidal frequencies used for statistical analyses in the paper "Estimating Freshwater Flows From Tidally-Affected Hydrographic Data" by Dan Pagendam and Don Percival.

  2. 1000 Empirical Time series

    • figshare.com
    • researchdata.edu.au
    png
    Updated May 30, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ben Fulcher (2023). 1000 Empirical Time series [Dataset]. http://doi.org/10.6084/m9.figshare.5436136.v10
    Explore at:
    pngAvailable download formats
    Dataset updated
    May 30, 2023
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Ben Fulcher
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    A diverse selection of 1000 empirical time series, along with results of an hctsa feature extraction, using v1.06 of hctsa and Matlab 2019b, computed on a server at The University of Sydney.The results of the computation are in the hctsa file, HCTSA_Empirical1000.mat for use in Matlab using v1.06 of hctsa.The same data is also provided in .csv format for the hctsa_datamatrix.csv (results of feature computation), with information about rows (time series) in hctsa_timeseries-info.csv, information about columns (features) in hctsa_features.csv (and corresponding hctsa code used to compute each feature in hctsa_masterfeatures.csv), and the data of individual time series (each line a time series, for time series described in hctsa_timeseries-info.csv) is in hctsa_timeseries-data.csv. These .csv files were produced by running >>OutputToCSV(HCTSA_Empirical1000.mat,true,true); in hctsa.The input file, INP_Empirical1000.mat, is for use with hctsa, and contains the time-series data and metadata for the 1000 time series. For example, massive feature extraction from these data on the user's machine, using hctsa, can proceed as>> TS_Init('INP_Empirical1000.mat');Some visualizations of the dataset are in CarpetPlot.png (first 1000 samples of all time series as a carpet (color) plot) and 150TS-250samples.png (conventional time-series plots of the first 250 samples of a sample of 150 time series from the dataset). More visualizations can be performed by the user using TS_PlotTimeSeries from the hctsa package.See links in references for more comprehensive documentation for performing methodological comparison using this dataset, and on how to download and use v1.06 of hctsa.

  3. The Canada Trademarks Dataset

    • zenodo.org
    pdf, zip
    Updated Jul 19, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jeremy Sheff; Jeremy Sheff (2024). The Canada Trademarks Dataset [Dataset]. http://doi.org/10.5281/zenodo.4999655
    Explore at:
    zip, pdfAvailable download formats
    Dataset updated
    Jul 19, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Jeremy Sheff; Jeremy Sheff
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Canada
    Description

    The Canada Trademarks Dataset

    18 Journal of Empirical Legal Studies 908 (2021), prepublication draft available at https://papers.ssrn.com/abstract=3782655, published version available at https://onlinelibrary.wiley.com/share/author/CHG3HC6GTFMMRU8UJFRR?target=10.1111/jels.12303

    Dataset Selection and Arrangement (c) 2021 Jeremy Sheff

    Python and Stata Scripts (c) 2021 Jeremy Sheff

    Contains data licensed by Her Majesty the Queen in right of Canada, as represented by the Minister of Industry, the minister responsible for the administration of the Canadian Intellectual Property Office.

    This individual-application-level dataset includes records of all applications for registered trademarks in Canada since approximately 1980, and of many preserved applications and registrations dating back to the beginning of Canada’s trademark registry in 1865, totaling over 1.6 million application records. It includes comprehensive bibliographic and lifecycle data; trademark characteristics; goods and services claims; identification of applicants, attorneys, and other interested parties (including address data); detailed prosecution history event data; and data on application, registration, and use claims in countries other than Canada. The dataset has been constructed from public records made available by the Canadian Intellectual Property Office. Both the dataset and the code used to build and analyze it are presented for public use on open-access terms.

    Scripts are licensed for reuse subject to the Creative Commons Attribution License 4.0 (CC-BY-4.0), https://creativecommons.org/licenses/by/4.0/. Data files are licensed for reuse subject to the Creative Commons Attribution License 4.0 (CC-BY-4.0), https://creativecommons.org/licenses/by/4.0/, and also subject to additional conditions imposed by the Canadian Intellectual Property Office (CIPO) as described below.

    Terms of Use:

    As per the terms of use of CIPO's government data, all users are required to include the above-quoted attribution to CIPO in any reproductions of this dataset. They are further required to cease using any record within the datasets that has been modified by CIPO and for which CIPO has issued a notice on its website in accordance with its Terms and Conditions, and to use the datasets in compliance with applicable laws. These requirements are in addition to the terms of the CC-BY-4.0 license, which require attribution to the author (among other terms). For further information on CIPO’s terms and conditions, see https://www.ic.gc.ca/eic/site/cipointernet-internetopic.nsf/eng/wr01935.html. For further information on the CC-BY-4.0 license, see https://creativecommons.org/licenses/by/4.0/.

    The following attribution statement, if included by users of this dataset, is satisfactory to the author, but the author makes no representations as to whether it may be satisfactory to CIPO:

    The Canada Trademarks Dataset is (c) 2021 by Jeremy Sheff and licensed under a CC-BY-4.0 license, subject to additional terms imposed by the Canadian Intellectual Property Office. It contains data licensed by Her Majesty the Queen in right of Canada, as represented by the Minister of Industry, the minister responsible for the administration of the Canadian Intellectual Property Office. For further information, see https://creativecommons.org/licenses/by/4.0/ and https://www.ic.gc.ca/eic/site/cipointernet-internetopic.nsf/eng/wr01935.html.

    Details of Repository Contents:

    This repository includes a number of .zip archives which expand into folders containing either scripts for construction and analysis of the dataset or data files comprising the dataset itself. These folders are as follows:

    • /csv: contains the .csv versions of the data files
    • /do: contains Stata do-files used to convert the .csv files to .dta format and perform the statistical analyses set forth in the paper reporting this dataset
    • /dta: contains the .dta versions of the data files
    • /py: contains the python scripts used to download CIPO’s historical trademarks data via SFTP and generate the .csv data files

    If users wish to construct rather than download the datafiles, the first script that they should run is /py/sftp_secure.py. This script will prompt the user to enter their IP Horizons SFTP credentials; these can be obtained by registering with CIPO at https://ised-isde.survey-sondage.ca/f/s.aspx?s=59f3b3a4-2fb5-49a4-b064-645a5e3a752d&lang=EN&ds=SFTP. The script will also prompt the user to identify a target directory for the data downloads. Because the data archives are quite large, users are advised to create a target directory in advance and ensure they have at least 70GB of available storage on the media in which the directory is located.

    The sftp_secure.py script will generate a new subfolder in the user’s target directory called /XML_raw. Users should note the full path of this directory, which they will be prompted to provide when running the remaining python scripts. Each of the remaining scripts, the filenames of which begin with “iterparse”, corresponds to one of the data files in the dataset, as indicated in the script’s filename. After running one of these scripts, the user’s target directory should include a /csv subdirectory containing the data file corresponding to the script; after running all the iterparse scripts the user’s /csv directory should be identical to the /csv directory in this repository. Users are invited to modify these scripts as they see fit, subject to the terms of the licenses set forth above.

    With respect to the Stata do-files, only one of them is relevant to construction of the dataset itself. This is /do/CA_TM_csv_cleanup.do, which converts the .csv versions of the data files to .dta format, and uses Stata’s labeling functionality to reduce the size of the resulting files while preserving information. The other do-files generate the analyses and graphics presented in the paper describing the dataset (Jeremy N. Sheff, The Canada Trademarks Dataset, 18 J. Empirical Leg. Studies (forthcoming 2021)), available at https://papers.ssrn.com/abstract=3782655). These do-files are also licensed for reuse subject to the terms of the CC-BY-4.0 license, and users are invited to adapt the scripts to their needs.

    The python and Stata scripts included in this repository are separately maintained and updated on Github at https://github.com/jnsheff/CanadaTM.

    This repository also includes a copy of the current version of CIPO's data dictionary for its historical XML trademarks archive as of the date of construction of this dataset.

  4. A

    ‘E-Designations: CSV file’ analyzed by Analyst-2

    • analyst-2.ai
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com), ‘E-Designations: CSV file’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/data-gov-e-designations-csv-file-9ee2/f71a6d90/?iid=011-360&v=presentation
    Explore at:
    Dataset authored and provided by
    Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Analysis of ‘E-Designations: CSV file’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://catalog.data.gov/dataset/13ba2f51-b7cd-48fc-86d4-273a0ae3502c on 13 February 2022.

    --- Dataset description provided by original source is as follows ---

    This data set contains information and locations of (E) Designations, including CEQR Environment Requirements (Table 1) and CEQR Restrictive Declarations (Table 2), in Appendix C of the Zoning Resolution. An (E) Designation provides notice of the presence of an environmental requirement pertaining to potential hazardous materials contamination, high ambient noise levels or air emission concerns on a particular tax lot.

    All previously released versions of this data are available at BYTES of the BIG APPLE- Archive

    --- Original source retains full ownership of the source dataset ---

  5. d

    Randomized Hourly Load Data for use with Taxonomy Distribution Feeders.

    • datadiscoverystudio.org
    • data.wu.ac.at
    Updated Aug 29, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2017). Randomized Hourly Load Data for use with Taxonomy Distribution Feeders. [Dataset]. http://datadiscoverystudio.org/geoportal/rest/metadata/item/bc873dbf6a1f44c190153d3345fbbafd/html
    Explore at:
    Dataset updated
    Aug 29, 2017
    Description

    description: This dataset was developed by NREL's distributed energy systems integration group as part of a study on high penetrations of distributed solar PV [1]. It consists of hourly load data in CSV format for use with the PNNL taxonomy of distribution feeders [2]. These feeders were developed in the open source GridLAB-D modelling language [3]. In this dataset each of the load points in the taxonomy feeders is populated with hourly averaged load data from a utility in the feeder s geographical region, scaled and randomized to emulate real load profiles. For more information on the scaling and randomization process, see [1]. The taxonomy feeders are statistically representative of the various types of distribution feeders found in five geographical regions of the U.S. Efforts are underway (possibly complete) to translate these feeders into the OpenDSS modelling language. This data set consists of one large CSV file for each feeder. Within each CSV, each column represents one load bus on the feeder. The header row lists the name of the load bus. The subsequent 8760 rows represent the loads for each hour of the year. The loads were scaled and randomized using a Python script, so each load series represents only one of many possible randomizations. In the header row, "rl" = residential load and "cl" = commercial load. Commercial loads are followed by a phase letter (A, B, or C). For regions 1-3, the data is from 2009. For regions 4-5, the data is from 2000. For use in GridLAB-D, each column will need to be separated into its own CSV file without a header. The load value goes in the second column, and corresponding datetime values go in the first column, as shown in the sample file, sample_individual_load_file.csv. Only the first value in the time column needs to written as an absolute time; subsequent times may be written in relative format (i.e. "+1h", as in the sample). The load should be written in P+Qj format, as seen in the sample CSV, in units of Watts (W) and Volt-amps reactive (VAr). This dataset was derived from metered load data and hence includes only real power; reactive power can be generated by assuming an appropriate power factor. These loads were used with GridLAB-D version 2.2. Browse files in this dataset, accessible as individual files and as a single ZIP file. This dataset is approximately 242MB compressed or 475MB uncompressed. For questions about this dataset, contact andy.hoke@nrel.gov. If you find this dataset useful, please mention NREL and cite [1] in your work. References: [1] A. Hoke, R. Butler, J. Hambrick, and B. Kroposki, Steady-State Analysis of Maximum Photovoltaic Penetration Levels on Typical Distribution Feeders, IEEE Transactions on Sustainable Energy, April 2013, available at http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=6357275 . [2] K. Schneider, D. P. Chassin, R. Pratt, D. Engel, and S. Thompson, Modern Grid Initiative Distribution Taxonomy Final Report, PNNL, Nov. 2008. Accessed April 27, 2012: http://www.gridlabd.org/models/feeders/taxonomy of prototypical feeders.pdf [3] K. Schneider, D. Chassin, Y. Pratt, and J. C. Fuller, Distribution power flow for smart grid technologies, IEEE/PES Power Systems Conference and Exposition, Seattle, WA, Mar. 2009, pp. 1-7, 15-18.; abstract: This dataset was developed by NREL's distributed energy systems integration group as part of a study on high penetrations of distributed solar PV [1]. It consists of hourly load data in CSV format for use with the PNNL taxonomy of distribution feeders [2]. These feeders were developed in the open source GridLAB-D modelling language [3]. In this dataset each of the load points in the taxonomy feeders is populated with hourly averaged load data from a utility in the feeder s geographical region, scaled and randomized to emulate real load profiles. For more information on the scaling and randomization process, see [1]. The taxonomy feeders are statistically representative of the various types of distribution feeders found in five geographical regions of the U.S. Efforts are underway (possibly complete) to translate these feeders into the OpenDSS modelling language. This data set consists of one large CSV file for each feeder. Within each CSV, each column represents one load bus on the feeder. The header row lists the name of the load bus. The subsequent 8760 rows represent the loads for each hour of the year. The loads were scaled and randomized using a Python script, so each load series represents only one of many possible randomizations. In the header row, "rl" = residential load and "cl" = commercial load. Commercial loads are followed by a phase letter (A, B, or C). For regions 1-3, the data is from 2009. For regions 4-5, the data is from 2000. For use in GridLAB-D, each column will need to be separated into its own CSV file without a header. The load value goes in the second column, and corresponding datetime values go in the first column, as shown in the sample file, sample_individual_load_file.csv. Only the first value in the time column needs to written as an absolute time; subsequent times may be written in relative format (i.e. "+1h", as in the sample). The load should be written in P+Qj format, as seen in the sample CSV, in units of Watts (W) and Volt-amps reactive (VAr). This dataset was derived from metered load data and hence includes only real power; reactive power can be generated by assuming an appropriate power factor. These loads were used with GridLAB-D version 2.2. Browse files in this dataset, accessible as individual files and as a single ZIP file. This dataset is approximately 242MB compressed or 475MB uncompressed. For questions about this dataset, contact andy.hoke@nrel.gov. If you find this dataset useful, please mention NREL and cite [1] in your work. References: [1] A. Hoke, R. Butler, J. Hambrick, and B. Kroposki, Steady-State Analysis of Maximum Photovoltaic Penetration Levels on Typical Distribution Feeders, IEEE Transactions on Sustainable Energy, April 2013, available at http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=6357275 . [2] K. Schneider, D. P. Chassin, R. Pratt, D. Engel, and S. Thompson, Modern Grid Initiative Distribution Taxonomy Final Report, PNNL, Nov. 2008. Accessed April 27, 2012: http://www.gridlabd.org/models/feeders/taxonomy of prototypical feeders.pdf [3] K. Schneider, D. Chassin, Y. Pratt, and J. C. Fuller, Distribution power flow for smart grid technologies, IEEE/PES Power Systems Conference and Exposition, Seattle, WA, Mar. 2009, pp. 1-7, 15-18.

  6. Z

    Data from: Large Landing Trajectory Data Set for Go-Around Analysis

    • data.niaid.nih.gov
    • zenodo.org
    Updated Dec 16, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Marcel Dettling (2022). Large Landing Trajectory Data Set for Go-Around Analysis [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7148116
    Explore at:
    Dataset updated
    Dec 16, 2022
    Dataset provided by
    Raphael Monstein
    Marcel Dettling
    Timothé Krauth
    Benoit Figuet
    Manuel Waltert
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Large go-around, also referred to as missed approach, data set. The data set is in support of the paper presented at the OpenSky Symposium on November the 10th.

    If you use this data for a scientific publication, please consider citing our paper.

    The data set contains landings from 176 (mostly) large airports from 44 different countries. The landings are labelled as performing a go-around (GA) or not. In total, the data set contains almost 9 million landings with more than 33000 GAs. The data was collected from OpenSky Network's historical data base for the year 2019. The published data set contains multiple files:

    go_arounds_minimal.csv.gz

    Compressed CSV containing the minimal data set. It contains a row for each landing and a minimal amount of information about the landing, and if it was a GA. The data is structured in the following way:

        Column name
        Type
        Description
    
    
    
    
        time
        date time
        UTC time of landing or first GA attempt
    
    
        icao24
        string
        Unique 24-bit (hexadecimal number) ICAO identifier of the aircraft concerned
    
    
        callsign
        string
        Aircraft identifier in air-ground communications
    
    
        airport
        string
        ICAO airport code where the aircraft is landing
    
    
        runway
        string
        Runway designator on which the aircraft landed
    
    
        has_ga
        string
        "True" if at least one GA was performed, otherwise "False"
    
    
        n_approaches
        integer
        Number of approaches identified for this flight
    
    
        n_rwy_approached
        integer
        Number of unique runways approached by this flight
    

    The last two columns, n_approaches and n_rwy_approached, are useful to filter out training and calibration flight. These have usually a large number of n_approaches, so an easy way to exclude them is to filter by n_approaches > 2.

    go_arounds_augmented.csv.gz

    Compressed CSV containing the augmented data set. It contains a row for each landing and additional information about the landing, and if it was a GA. The data is structured in the following way:

        Column name
        Type
        Description
    
    
    
    
        time
        date time
        UTC time of landing or first GA attempt
    
    
        icao24
        string
        Unique 24-bit (hexadecimal number) ICAO identifier of the aircraft concerned
    
    
        callsign
        string
        Aircraft identifier in air-ground communications
    
    
        airport
        string
        ICAO airport code where the aircraft is landing
    
    
        runway
        string
        Runway designator on which the aircraft landed
    
    
        has_ga
        string
        "True" if at least one GA was performed, otherwise "False"
    
    
        n_approaches
        integer
        Number of approaches identified for this flight
    
    
        n_rwy_approached
        integer
        Number of unique runways approached by this flight
    
    
        registration
        string
        Aircraft registration
    
    
        typecode
        string
        Aircraft ICAO typecode
    
    
        icaoaircrafttype
        string
        ICAO aircraft type
    
    
        wtc
        string
        ICAO wake turbulence category
    
    
        glide_slope_angle
        float
        Angle of the ILS glide slope in degrees
    
    
        has_intersection
    

    string

        Boolean that is true if the runway has an other runway intersecting it, otherwise false
    
    
        rwy_length
        float
        Length of the runway in kilometre
    
    
        airport_country
        string
        ISO Alpha-3 country code of the airport
    
    
        airport_region
        string
        Geographical region of the airport (either Europe, North America, South America, Asia, Africa, or Oceania)
    
    
        operator_country
        string
        ISO Alpha-3 country code of the operator
    
    
        operator_region
        string
        Geographical region of the operator of the aircraft (either Europe, North America, South America, Asia, Africa, or Oceania)
    
    
        wind_speed_knts
        integer
        METAR, surface wind speed in knots
    
    
        wind_dir_deg
        integer
        METAR, surface wind direction in degrees
    
    
        wind_gust_knts
        integer
        METAR, surface wind gust speed in knots
    
    
        visibility_m
        float
        METAR, visibility in m
    
    
        temperature_deg
        integer
        METAR, temperature in degrees Celsius
    
    
        press_sea_level_p
        float
        METAR, sea level pressure in hPa
    
    
        press_p
        float
        METAR, QNH in hPA
    
    
        weather_intensity
        list
        METAR, list of present weather codes: qualifier - intensity
    
    
        weather_precipitation
        list
        METAR, list of present weather codes: weather phenomena - precipitation
    
    
        weather_desc
        list
        METAR, list of present weather codes: qualifier - descriptor
    
    
        weather_obscuration
        list
        METAR, list of present weather codes: weather phenomena - obscuration
    
    
        weather_other
        list
        METAR, list of present weather codes: weather phenomena - other
    

    This data set is augmented with data from various public data sources. Aircraft related data is mostly from the OpenSky Network's aircraft data base, the METAR information is from the Iowa State University, and the rest is mostly scraped from different web sites. If you need help with the METAR information, you can consult the WMO's Aerodrom Reports and Forecasts handbook.

    go_arounds_agg.csv.gz

    Compressed CSV containing the aggregated data set. It contains a row for each airport-runway, i.e. every runway at every airport for which data is available. The data is structured in the following way:

        Column name
        Type
        Description
    
    
    
    
        airport
        string
        ICAO airport code where the aircraft is landing
    
    
        runway
        string
        Runway designator on which the aircraft landed
    
    
        n_landings
        integer
        Total number of landings observed on this runway in 2019
    
    
        ga_rate
        float
        Go-around rate, per 1000 landings
    
    
        glide_slope_angle
        float
        Angle of the ILS glide slope in degrees
    
    
        has_intersection
        string
        Boolean that is true if the runway has an other runway intersecting it, otherwise false
    
    
        rwy_length
        float
        Length of the runway in kilometres
    
    
        airport_country
        string
        ISO Alpha-3 country code of the airport
    
    
        airport_region
        string
        Geographical region of the airport (either Europe, North America, South America, Asia, Africa, or Oceania)
    

    This aggregated data set is used in the paper for the generalized linear regression model.

    Downloading the trajectories

    Users of this data set with access to OpenSky Network's Impala shell can download the historical trajectories from the historical data base with a few lines of Python code. For example, you want to get all the go-arounds of the 4th of January 2019 at London City Airport (EGLC). You can use the Traffic library for easy access to the database:

    import datetime from tqdm.auto import tqdm import pandas as pd from traffic.data import opensky from traffic.core import Traffic

    load minimum data set

    df = pd.read_csv("go_arounds_minimal.csv.gz", low_memory=False) df["time"] = pd.to_datetime(df["time"])

    select London City Airport, go-arounds, and 2019-01-04

    airport = "EGLC" start = datetime.datetime(year=2019, month=1, day=4).replace( tzinfo=datetime.timezone.utc ) stop = datetime.datetime(year=2019, month=1, day=5).replace( tzinfo=datetime.timezone.utc )

    df_selection = df.query("airport==@airport & has_ga & (@start <= time <= @stop)")

    iterate over flights and pull the data from OpenSky Network

    flights = [] delta_time = pd.Timedelta(minutes=10) for _, row in tqdm(df_selection.iterrows(), total=df_selection.shape[0]): # take at most 10 minutes before and 10 minutes after the landing or go-around start_time = row["time"] - delta_time stop_time = row["time"] + delta_time

    # fetch the data from OpenSky Network
    flights.append(
      opensky.history(
        start=start_time.strftime("%Y-%m-%d %H:%M:%S"),
        stop=stop_time.strftime("%Y-%m-%d %H:%M:%S"),
        callsign=row["callsign"],
        return_flight=True,
      )
    )
    

    The flights can be converted into a Traffic object

    Traffic.from_flights(flights)

    Additional files

    Additional files are available to check the quality of the classification into GA/not GA and the selection of the landing runway. These are:

    validation_table.xlsx: This Excel sheet was manually completed during the review of the samples for each runway in the data set. It provides an estimate of the false positive and false negative rate of the go-around classification. It also provides an estimate of the runway misclassification rate when the airport has two or more parallel runways. The columns with the headers highlighted in red were filled in manually, the rest is generated automatically.

    validation_sample.zip: For each runway, 8 batches of 500 randomly selected trajectories (or as many as available, if fewer than 4000) classified as not having a GA and up to 8 batches of 10 random landings, classified as GA, are plotted. This allows the interested user to visually inspect a random sample of the landings and go-arounds easily.

  7. Film Circulation dataset

    • zenodo.org
    • data.niaid.nih.gov
    bin, csv, png
    Updated Jul 12, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Skadi Loist; Skadi Loist; Evgenia (Zhenya) Samoilova; Evgenia (Zhenya) Samoilova (2024). Film Circulation dataset [Dataset]. http://doi.org/10.5281/zenodo.7887672
    Explore at:
    csv, png, binAvailable download formats
    Dataset updated
    Jul 12, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Skadi Loist; Skadi Loist; Evgenia (Zhenya) Samoilova; Evgenia (Zhenya) Samoilova
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Complete dataset of “Film Circulation on the International Film Festival Network and the Impact on Global Film Culture”

    A peer-reviewed data paper for this dataset is in review to be published in NECSUS_European Journal of Media Studies - an open access journal aiming at enhancing data transparency and reusability, and will be available from https://necsus-ejms.org/ and https://mediarep.org

    Please cite this when using the dataset.


    Detailed description of the dataset:

    1 Film Dataset: Festival Programs

    The Film Dataset consists a data scheme image file, a codebook and two dataset tables in csv format.

    The codebook (csv file “1_codebook_film-dataset_festival-program”) offers a detailed description of all variables within the Film Dataset. Along with the definition of variables it lists explanations for the units of measurement, data sources, coding and information on missing data.

    The csv file “1_film-dataset_festival-program_long” comprises a dataset of all films and the festivals, festival sections, and the year of the festival edition that they were sampled from. The dataset is structured in the long format, i.e. the same film can appear in several rows when it appeared in more than one sample festival. However, films are identifiable via their unique ID.

    The csv file “1_film-dataset_festival-program_wide” consists of the dataset listing only unique films (n=9,348). The dataset is in the wide format, i.e. each row corresponds to a unique film, identifiable via its unique ID. For easy analysis, and since the overlap is only six percent, in this dataset the variable sample festival (fest) corresponds to the first sample festival where the film appeared. For instance, if a film was first shown at Berlinale (in February) and then at Frameline (in June of the same year), the sample festival will list “Berlinale”. This file includes information on unique and IMDb IDs, the film title, production year, length, categorization in length, production countries, regional attribution, director names, genre attribution, the festival, festival section and festival edition the film was sampled from, and information whether there is festival run information available through the IMDb data.


    2 Survey Dataset

    The Survey Dataset consists of a data scheme image file, a codebook and two dataset tables in csv format.

    The codebook “2_codebook_survey-dataset” includes coding information for both survey datasets. It lists the definition of the variables or survey questions (corresponding to Samoilova/Loist 2019), units of measurement, data source, variable type, range and coding, and information on missing data.

    The csv file “2_survey-dataset_long-festivals_shared-consent” consists of a subset (n=161) of the original survey dataset (n=454), where respondents provided festival run data for films (n=206) and gave consent to share their data for research purposes. This dataset consists of the festival data in a long format, so that each row corresponds to the festival appearance of a film.

    The csv file “2_survey-dataset_wide-no-festivals_shared-consent” consists of a subset (n=372) of the original dataset (n=454) of survey responses corresponding to sample films. It includes data only for those films for which respondents provided consent to share their data for research purposes. This dataset is shown in wide format of the survey data, i.e. information for each response corresponding to a film is listed in one row. This includes data on film IDs, film title, survey questions regarding completeness and availability of provided information, information on number of festival screenings, screening fees, budgets, marketing costs, market screenings, and distribution. As the file name suggests, no data on festival screenings is included in the wide format dataset.


    3 IMDb & Scripts

    The IMDb dataset consists of a data scheme image file, one codebook and eight datasets, all in csv format. It also includes the R scripts that we used for scraping and matching.

    The codebook “3_codebook_imdb-dataset” includes information for all IMDb datasets. This includes ID information and their data source, coding and value ranges, and information on missing data.

    The csv file “3_imdb-dataset_aka-titles_long” contains film title data in different languages scraped from IMDb in a long format, i.e. each row corresponds to a title in a given language.

    The csv file “3_imdb-dataset_awards_long” contains film award data in a long format, i.e. each row corresponds to an award of a given film.

    The csv file “3_imdb-dataset_companies_long” contains data on production and distribution companies of films. The dataset is in a long format, so that each row corresponds to a particular company of a particular film.

    The csv file “3_imdb-dataset_crew_long” contains data on names and roles of crew members in a long format, i.e. each row corresponds to each crew member. The file also contains binary gender assigned to directors based on their first names using the GenderizeR application.

    The csv file “3_imdb-dataset_festival-runs_long” contains festival run data scraped from IMDb in a long format, i.e. each row corresponds to the festival appearance of a given film. The dataset does not include each film screening, but the first screening of a film at a festival within a given year. The data includes festival runs up to 2019.

    The csv file “3_imdb-dataset_general-info_wide” contains general information about films such as genre as defined by IMDb, languages in which a film was shown, ratings, and budget. The dataset is in wide format, so that each row corresponds to a unique film.

    The csv file “3_imdb-dataset_release-info_long” contains data about non-festival release (e.g., theatrical, digital, tv, dvd/blueray). The dataset is in a long format, so that each row corresponds to a particular release of a particular film.

    The csv file “3_imdb-dataset_websites_long” contains data on available websites (official websites, miscellaneous, photos, video clips). The dataset is in a long format, so that each row corresponds to a website of a particular film.

    The dataset includes 8 text files containing the script for webscraping. They were written using the R-3.6.3 version for Windows.

    The R script “r_1_unite_data” demonstrates the structure of the dataset, that we use in the following steps to identify, scrape, and match the film data.

    The R script “r_2_scrape_matches” reads in the dataset with the film characteristics described in the “r_1_unite_data” and uses various R packages to create a search URL for each film from the core dataset on the IMDb website. The script attempts to match each film from the core dataset to IMDb records by first conducting an advanced search based on the movie title and year, and then potentially using an alternative title and a basic search if no matches are found in the advanced search. The script scrapes the title, release year, directors, running time, genre, and IMDb film URL from the first page of the suggested records from the IMDb website. The script then defines a loop that matches (including matching scores) each film in the core dataset with suggested films on the IMDb search page. Matching was done using data on directors, production year (+/- one year), and title, a fuzzy matching approach with two methods: “cosine” and “osa.” where the cosine similarity is used to match titles with a high degree of similarity, and the OSA algorithm is used to match titles that may have typos or minor variations.

    The script “r_3_matching” creates a dataset with the matches for a manual check. Each pair of films (original film from the core dataset and the suggested match from the IMDb website was categorized in the following five categories: a) 100% match: perfect match on title, year, and director; b) likely good match; c) maybe match; d) unlikely match; and e) no match). The script also checks for possible doubles in the dataset and identifies them for a manual check.

    The script “r_4_scraping_functions” creates a function for scraping the data from the identified matches (based on the scripts described above and manually checked). These functions are used for scraping the data in the next script.

    The script “r_5a_extracting_info_sample” uses the function defined in the “r_4_scraping_functions”, in order to scrape the IMDb data for the identified matches. This script does that for the first 100 films, to check, if everything works. Scraping for the entire dataset took a few hours. Therefore, a test with a subsample of 100 films is advisable.

    The script “r_5b_extracting_info_all” extracts the data for the entire dataset of the identified matches.

    The script “r_5c_extracting_info_skipped” checks the films with missing data (where data was not scraped) and tried to extract data one more time to make sure that the errors were not caused by disruptions in the internet connection or other technical issues.

    The script “r_check_logs” is used for troubleshooting and tracking the progress of all of the R scripts used. It gives information on the amount of missing values and errors.


    4 Festival Library Dataset

    The Festival Library Dataset consists of a data scheme image file, one codebook and one dataset, all in csv format.

    The codebook (csv file “4_codebook_festival-library_dataset”) offers a detailed description of all variables within the Library Dataset. It lists the definition of variables, such as location and festival name, and festival categories,

  8. c

    Comprehensive eBay Products Dataset: Analyze Listings, Prices, and Trends |...

    • crawlfeeds.com
    csv, zip
    Updated Jul 9, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Crawl Feeds (2025). Comprehensive eBay Products Dataset: Analyze Listings, Prices, and Trends | Download Now! [Dataset]. https://crawlfeeds.com/datasets/ebay-products-dataset
    Explore at:
    csv, zipAvailable download formats
    Dataset updated
    Jul 9, 2025
    Dataset authored and provided by
    Crawl Feeds
    License

    https://crawlfeeds.com/privacy_policyhttps://crawlfeeds.com/privacy_policy

    Description

    Massive eBay Marketplace Data Collection for E-commerce Intelligence

    Unlock the power of online marketplace analytics with our comprehensive eBay products dataset. This premium collection contains 1.29 million products from eBay's global marketplace, providing extensive insights into one of the world's largest e-commerce platforms. Perfect for competitive analysis, pricing strategies, market research, and machine learning applications in e-commerce.

    Dataset Overview

    • Total Products: 1,290,000+ marketplace listings
    • Source: eBay Global Marketplace
    • Format: CSV, ZIP compressed
    • File Size: Optimized compressed format
    • Coverage: Multi-category product listings across eBay

    Complete Data Fields Included

    Product Identification

    • id: Unique eBay product identifiers
    • name: Complete product titles and names
    • url: Direct eBay listing page links
    • epid: eBay Product ID for catalog matching
    • source: Data source identification

    Product Details

    • raw_product_description: Original unprocessed product descriptions
    • product_description: Cleaned and formatted product descriptions
    • brand: Brand names and manufacturer information
    • mpn: Manufacturer Part Numbers
    • gtin13: Global Trade Item Numbers (barcodes)

    Pricing and Availability

    • price: Current listing prices
    • currency: Currency information for international listings
    • in_stock: Stock availability status
    • breadcrumbs: Category navigation paths

    Visual and Technical Data

    • images: Product image URLs and references
    • crawled_at: Data collection timestamps

    Key Use Cases

    E-commerce Market Research

    • Analyze eBay marketplace trends and patterns
    • Study product category performance
    • Monitor pricing strategies across sellers
    • Identify high-demand product categories

    Competitive Intelligence

    • Benchmark pricing against eBay marketplace
    • Analyze product positioning strategies
    • Study seller competition and market share
    • Monitor inventory levels and availability

    Price Optimization

    • Develop dynamic pricing algorithms
    • Analyze price elasticity across categories
    • Compare marketplace pricing trends
    • Optimize listing prices for maximum visibility

    Machine Learning Applications

    • Train recommendation systems
    • Develop price prediction models
    • Create product categorization algorithms
    • Build inventory forecasting systems

    Target Industries

    E-commerce Retailers

    • Marketplace Strategy: Optimize eBay selling strategies
    • Pricing Intelligence: Competitive price monitoring
    • Product Research: Identify profitable product opportunities
    • Inventory Planning: Demand forecasting and stock optimization

    Technology Companies

    • AI Training Data: Machine learning model development
    • Analytics Platforms: E-commerce intelligence tools
    • Price Comparison: Marketplace comparison services
    • Search Enhancement: Product discovery optimization

    Market Research Firms

    • Industry Reports: E-commerce marketplace analysis
    • Consumer Behavior: Online shopping pattern studies
    • Brand Monitoring: Brand performance tracking
    • Trend Analysis: Market trend identification

    Academic Research

    • E-commerce Studies: Online marketplace research
    • Business Intelligence: Retail analytics case studies
    • Data Science Projects: Large-scale dataset analysis
    • Economic Research: Digital marketplace economics

    Data Quality Features

    • Comprehensive Coverage: 1.29 million unique products
    • Rich Metadata: Complete product information included
    • Validated Data: Quality-checked and processed
    • Structured Format: Ready-to-use CSV format
    • Global Scope: International marketplace coverage
  9. c

    Mecca Australia Extracted Data in CSV Format

    • crawlfeeds.com
    csv, zip
    Updated Sep 2, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Crawl Feeds (2024). Mecca Australia Extracted Data in CSV Format [Dataset]. https://crawlfeeds.com/datasets/mecca-australia-extracted-data-in-csv-format
    Explore at:
    csv, zipAvailable download formats
    Dataset updated
    Sep 2, 2024
    Dataset authored and provided by
    Crawl Feeds
    License

    https://crawlfeeds.com/privacy_policyhttps://crawlfeeds.com/privacy_policy

    Description

    format. This dataset provides comprehensive details on a wide range of beauty products listed on Mecca Australia, one of the leading beauty retailers in the country.

    Perfect for market researchers, data analysts, and beauty industry professionals, this dataset enables a deep dive into product offerings and trends without the clutter of customer reviews.

    Features:

    • Product Information: Detailed data on various beauty products, including product names, categories, and brands.
    • Pricing Data: Up-to-date pricing details for each product, allowing for competitive analysis and pricing strategy development.
    • Product Descriptions: Comprehensive descriptions that provide insight into product features and benefits.
    • Stock Availability: Information on stock status to help track product availability and manage inventory.
    • CSV Format: Easy-to-use CSV file format for seamless integration into any data analysis or business intelligence tool.

    Applications:

    • Market Analysis: Gain insights into the beauty market trends in Australia by analyzing product categories, brands, and pricing.
    • Competitor Research: Compare product offerings and pricing strategies to understand the competitive landscape.
    • Inventory Management: Use stock availability data to optimize inventory and ensure popular items are always in stock.
    • Product Development: Leverage product descriptions to identify gaps in the market and innovate new product offerings.

    With the "Mecca Australia Extracted Data" in CSV format, you can easily access and analyze crucial product data, enabling informed decision-making and strategic planning in the beauty industry.

  10. Data from: Current and projected research data storage needs of Agricultural...

    • catalog.data.gov
    • agdatacommons.nal.usda.gov
    • +2more
    Updated Apr 21, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Agricultural Research Service (2025). Current and projected research data storage needs of Agricultural Research Service researchers in 2016 [Dataset]. https://catalog.data.gov/dataset/current-and-projected-research-data-storage-needs-of-agricultural-research-service-researc-f33da
    Explore at:
    Dataset updated
    Apr 21, 2025
    Dataset provided by
    Agricultural Research Servicehttps://www.ars.usda.gov/
    Description

    The USDA Agricultural Research Service (ARS) recently established SCINet , which consists of a shared high performance computing resource, Ceres, and the dedicated high-speed Internet2 network used to access Ceres. Current and potential SCINet users are using and generating very large datasets so SCINet needs to be provisioned with adequate data storage for their active computing. It is not designed to hold data beyond active research phases. At the same time, the National Agricultural Library has been developing the Ag Data Commons, a research data catalog and repository designed for public data release and professional data curation. Ag Data Commons needs to anticipate the size and nature of data it will be tasked with handling. The ARS Web-enabled Databases Working Group, organized under the SCINet initiative, conducted a study to establish baseline data storage needs and practices, and to make projections that could inform future infrastructure design, purchases, and policies. The SCINet Web-enabled Databases Working Group helped develop the survey which is the basis for an internal report. While the report was for internal use, the survey and resulting data may be generally useful and are being released publicly. From October 24 to November 8, 2016 we administered a 17-question survey (Appendix A) by emailing a Survey Monkey link to all ARS Research Leaders, intending to cover data storage needs of all 1,675 SY (Category 1 and Category 4) scientists. We designed the survey to accommodate either individual researcher responses or group responses. Research Leaders could decide, based on their unit's practices or their management preferences, whether to delegate response to a data management expert in their unit, to all members of their unit, or to themselves collate responses from their unit before reporting in the survey. Larger storage ranges cover vastly different amounts of data so the implications here could be significant depending on whether the true amount is at the lower or higher end of the range. Therefore, we requested more detail from "Big Data users," those 47 respondents who indicated they had more than 10 to 100 TB or over 100 TB total current data (Q5). All other respondents are called "Small Data users." Because not all of these follow-up requests were successful, we used actual follow-up responses to estimate likely responses for those who did not respond. We defined active data as data that would be used within the next six months. All other data would be considered inactive, or archival. To calculate per person storage needs we used the high end of the reported range divided by 1 for an individual response, or by G, the number of individuals in a group response. For Big Data users we used the actual reported values or estimated likely values. Resources in this dataset:Resource Title: Appendix A: ARS data storage survey questions. File Name: Appendix A.pdfResource Description: The full list of questions asked with the possible responses. The survey was not administered using this PDF but the PDF was generated directly from the administered survey using the Print option under Design Survey. Asterisked questions were required. A list of Research Units and their associated codes was provided in a drop down not shown here. Resource Software Recommended: Adobe Acrobat,url: https://get.adobe.com/reader/ Resource Title: CSV of Responses from ARS Researcher Data Storage Survey. File Name: Machine-readable survey response data.csvResource Description: CSV file includes raw responses from the administered survey, as downloaded unfiltered from Survey Monkey, including incomplete responses. Also includes additional classification and calculations to support analysis. Individual email addresses and IP addresses have been removed. This information is that same data as in the Excel spreadsheet (also provided).Resource Title: Responses from ARS Researcher Data Storage Survey. File Name: Data Storage Survey Data for public release.xlsxResource Description: MS Excel worksheet that Includes raw responses from the administered survey, as downloaded unfiltered from Survey Monkey, including incomplete responses. Also includes additional classification and calculations to support analysis. Individual email addresses and IP addresses have been removed.Resource Software Recommended: Microsoft Excel,url: https://products.office.com/en-us/excel

  11. HR Dataset.csv

    • kaggle.com
    Updated Mar 8, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Fahad Rehman (2024). HR Dataset.csv [Dataset]. https://www.kaggle.com/datasets/fahadrehman07/hr-comma-sep-csv
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 8, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Fahad Rehman
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    🟡Please Upvote my dataset If you like It.✨

    This dataset contains valuable employee information over time that can be analyzed to help optimize key HR functions. Some potential use cases include:

    Attrition analysis: Identify factors correlated with attrition like department, role, salary, etc. Segment high-risk employees. Predict future attrition.

    Performance management: Analyze the relationship between metrics like ratings, and salary increments. recommend performance improvement programs.

    Workforce planning: Forecast staffing needs based on historical hiring/turnover trends. Determine optimal recruitment strategies.

    Compensation analysis: Benchmark salaries vs performance, and experience. Identify pay inequities. Inform compensation policies.

    Diversity monitoring: Assess diversity metrics like gender ratio over roles, and departments. Identify underrepresented groups.

    Succession planning: Identify high-potential candidates and critical roles. Predict internal promotions/replacements in advance.

    Given its longitudinal employee data and multiple variables, this dataset provides rich opportunities for exploration, predictive modeling, and actionable insights. With a large sample size, it can uncover subtle patterns. Cleaning, joining with other contextual data sources can yield even deeper insights. This makes it a valuable starting point for many organizational studies and evidence-based decision-making.

    .............................................................................................................................................................................................................................................

    This dataset contains information about different attributes of employees from a company. It includes 1000 employee records and 12 feature columns.

    The columns are:

    satisfaction_level: Employee satisfaction score (1-5 scale) last_evaluation: Score on last evaluation (1-5 scale) number_project: Number of projects employee worked on average_monthly_hours: Average hours worked in a month time_spend_company: Number of years spent with the company work_accident: If an employee had a workplace accident (yes/no) left: If an employee has left the company (yes/no) promotion_last_5years: Number of promotions in last 5 years Department: Department of the employee Salary: Annual salary of employee satisfaction_level: Employee satisfaction level (1-5 scale) last_evaluation: Score on last evaluation (1-5 scale)

  12. Metadata record for: The COUGHVID crowdsourcing dataset, a corpus for the...

    • springernature.figshare.com
    txt
    Updated May 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Scientific Data Curation Team (2023). Metadata record for: The COUGHVID crowdsourcing dataset, a corpus for the study of large-scale cough analysis algorithms [Dataset]. http://doi.org/10.6084/m9.figshare.14377019.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    May 30, 2023
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Scientific Data Curation Team
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    This dataset contains key characteristics about the data described in the Data Descriptor The COUGHVID crowdsourcing dataset, a corpus for the study of large-scale cough analysis algorithms. Contents:

        1. human readable metadata summary table in CSV format
    
    
        2. machine readable metadata file in JSON format
    
  13. c

    Fox News dataset is for analyzing media trends and narratives

    • crawlfeeds.com
    csv, zip
    Updated May 19, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Crawl Feeds (2025). Fox News dataset is for analyzing media trends and narratives [Dataset]. https://crawlfeeds.com/datasets/fox-news-dataset
    Explore at:
    zip, csvAvailable download formats
    Dataset updated
    May 19, 2025
    Dataset authored and provided by
    Crawl Feeds
    License

    https://crawlfeeds.com/privacy_policyhttps://crawlfeeds.com/privacy_policy

    Description

    The Fox News Dataset is a comprehensive collection of over 1 million news articles, offering an unparalleled resource for analyzing media narratives, public discourse, and political trends. Covering articles up to the year 2023, this dataset is a treasure trove for researchers, analysts, and businesses interested in gaining deeper insights into the topics and trends covered by Fox News.

    Key Features of the Fox News Dataset

    • Extensive Coverage: Contains more than 1 million articles spanning various topics and events up to 2023.
    • Research-Ready: Perfect for text classification, natural language processing (NLP), and other research purposes.
    • Format: Provided in CSV format for seamless integration into analytical and research tools.

    Why Use This Dataset?

    This large dataset is ideal for:

    • Text Classification: Develop machine learning models to classify and categorize news content.
    • Natural Language Processing (NLP): Conduct sentiment analysis, keyword extraction, or topic modeling.
    • Media and Political Research: Analyze media narratives, public opinion, and political trends reflected in Fox News articles.
    • Trend Analysis: Identify shifts in public discourse and media focus over time.

    Explore More News Datasets

    Discover additional resources for your research needs by visiting our news dataset collection. These datasets are tailored to support diverse analytical applications, including sentiment analysis and trend modeling.

    The Fox News Dataset is a must-have for anyone interested in exploring large-scale media data and leveraging it for advanced analysis. Ready to dive into this wealth of information? Download the dataset now in CSV format and start uncovering the stories behind the headlines.

  14. s

    Data from: Data files used to study change dynamics in software systems

    • figshare.swinburne.edu.au
    pdf
    Updated Jul 22, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rajesh Vasa (2024). Data files used to study change dynamics in software systems [Dataset]. http://doi.org/10.25916/sut.26288227.v1
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Jul 22, 2024
    Dataset provided by
    Swinburne
    Authors
    Rajesh Vasa
    License

    Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
    License information was derived automatically

    Description

    It is a widely accepted fact that evolving software systems change and grow. However, it is less well-understood how change is distributed over time, specifically in object oriented software systems. The patterns and techniques used to measure growth permit developers to identify specific releases where significant change took place as well as to inform them of the longer term trend in the distribution profile. This knowledge assists developers in recording systemic and substantial changes to a release, as well as to provide useful information as input into a potential release retrospective. However, these analysis methods can only be applied after a mature release of the code has been developed. But in order to manage the evolution of complex software systems effectively, it is important to identify change-prone classes as early as possible. Specifically, developers need to know where they can expect change, the likelihood of a change, and the magnitude of these modifications in order to take proactive steps and mitigate any potential risks arising from these changes. Previous research into change-prone classes has identified some common aspects, with different studies suggesting that complex and large classes tend to undergo more changes and classes that changed recently are likely to undergo modifications in the near future. Though the guidance provided is helpful, developers need more specific guidance in order for it to be applicable in practice. Furthermore, the information needs to be available at a level that can help in developing tools that highlight and monitor evolution prone parts of a system as well as support effort estimation activities. The specific research questions that we address in this chapter are: (1) What is the likelihood that a class will change from a given version to the next? (a) Does this probability change over time? (b) Is this likelihood project specific, or general? (2) How is modification frequency distributed for classes that change? (3) What is the distribution of the magnitude of change? Are most modifications minor adjustments, or substantive modifications? (4) Does structural complexity make a class susceptible to change? (5) Does popularity make a class more change-prone? We make recommendations that can help developers to proactively monitor and manage change. These are derived from a statistical analysis of change in approximately 55000 unique classes across all projects under investigation. The analysis methods that we applied took into consideration the highly skewed nature of the metric data distributions. The raw metric data (4 .txt files and 4 .log files in a .zip file measuring ~2MB in total) is provided as a comma separated values (CSV) file, and the first line of the CSV file contains the header. A detailed output of the statistical analysis undertaken is provided as log files generated directly from Stata (statistical analysis software).

  15. Yost Index with 90% confidence intervals (with all contributing source files...

    • figshare.com
    zip
    Updated May 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Francis P. Boscoe; Bian Liu; Furrina F. Lee; Li Niu; jordana lafantasie (2023). Yost Index with 90% confidence intervals (with all contributing source files - LARGE) [Dataset]. http://doi.org/10.6084/m9.figshare.16649773.v3
    Explore at:
    zipAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    Figsharehttp://figshare.com/
    figshare
    Authors
    Francis P. Boscoe; Bian Liu; Furrina F. Lee; Li Niu; jordana lafantasie
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    We extend our previous work with the Yost Index by adding 90% confidence intervals to the index values. These were calculated using the variance replicate estimates published in association with the American Community Survey of the United States Census Bureau.

    In the file yost-tract-2015-2019.csv, the data fields consists of 11-digit geographic ID built from FIPS codes (2 digit state, 3 digit county, 6 digit census tract); Yost index, 90% lower confidence interval; 90% upper confidence interval. Data is provided for 72,793 census tracts for which sufficient data were available. The Yost Index ranges from 1 (lowest socioeconomic position) to 100 (highest socioeconomic position).

    For those only interested in using the index as we have calculated it, the file yost-tract-2015-2019 is the only file you need. The other 368 files here are provided for anyone who wishes to replicate our results using the R program yost-conf-intervals.R. The program presumes the user is running Windows machine and that all files reside in a folder called C:/yostindex. The R program requires a number of packages, all of which are specified in lines 10-22 of the program.

    Details of this project were published in Boscoe FP, Liu B, LaFantasie J, Niu L, Lee FF. Estimating uncertainty in a socioeconomic index derived from the American Community Survey. SSM-Population Health 2022; 18: 101078. Full text

    Additional years of data following this format are planned to be added to this repository in time.

  16. Replication Package of the paper "Large Language Models for Multilingual...

    • zenodo.org
    zip
    Updated Mar 14, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anonymous; Anonymous (2025). Replication Package of the paper "Large Language Models for Multilingual Code Generation: A Benchmark and a Study on Code Quality" [Dataset]. http://doi.org/10.5281/zenodo.15028641
    Explore at:
    zipAvailable download formats
    Dataset updated
    Mar 14, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Anonymous; Anonymous
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Large Language Models for Multilingual Code Generation: A Benchmark and a Study on Code Quality

    Abstract

    Having been trained in the wild, Large Language Models (LLMs) may suffer from different types of bias. As shown in previous studies outside software engineering, this includes a language bias, i.e., these models perform differently depending on the language used for the query/prompt. However, so far the impact of language bias on source code generation has not been thoroughly investigated. Therefore, in this paper, we study the influence of the language adopted in the prompt on the quality of the source code generated by three LLMs, specifically GPT, Claude, and DeepSeek. We consider 230 coding tasks for Python and 230 for Java, and translate their related prompts into four languages: Chinese, Hindi, Spanish, and Italian. After generating the code, we measure code quality in terms of passed tests, code metrics, warnings generated by static analysis tools, and language used for the identifiers. Results indicate that (i) source code generated from the English queries is not necessarily better in terms of passed test and quality metrics, (ii) the quality for different languages varies depending on the programming language and LLM being used, and (iii) the generated code tend to contain mixes of comments and literals written in English and the language used to formulate the prompt.

    Replication Package

    This replication package is organized into two main directories: data and scripts. The datadirectory contains all the data used in the analysis, including prompts and final results. The scripts directory contains all the Python scripts used for code generation and analysis.

    Data

    The data directory contains five subdirectories, each corresponding to a stage in the analysis pipeline. These are enumerated to reflect the order of the process:

    1. prompt_translation: Contains files with manually translated prompts for each language. Each file is associated with both Python and Java. The structure of each file is as follows:

      • id: The ID of the query in the CoderEval benchmark.
      • prompt: The original English prompt.
      • summary: The original summary.
      • code: The original code.
      • translation: The translation generated by GPT.
      • correction: The manual correction of the GPT-generated translation.
      • correction_tag: A list of tags indicating the corrections made to the translation.
      • generated_code: This column is initially empty and will contain the code generated from the translated prompt.
    2. generation: Contains the code generated by the three LLMs for each programming language and natural language. Each subdirectory (e.g., java_chinese_claude) contains the following:

      • files: The files with the generated code (named by the query ID).
      • report: Reports generated by static analysis tools.
      • A CSV file (e.g., java_chinese_claude.csv) containing the generated code in the corresponding column.
    3. tests: Contains input files for the testing process and the results of the tests. Files in the input_files directory are formatted according to the CoderEval benchmark requirements. The results directory holds the output of the testing process.

    4. quantitative_analysis: Contains all the csv reports of the static analysis tools and test output for all languages and models. These files are the inputs for the statistical analysis. The directory stats contains all the output tables for the statistical analysis, which are shown in paper's tables.

    5. qualitative_analysis: Contains files used for the qualitative analysis:

      • CohenKappaagreement.csv: A file containing the subset used to compute Cohen's kappa metrics for manual analysis.
      • files: Contains all files for the qualitative analysis. Each file has the following columns:
        • id: The ID of the query in the CoderEval benchmark.
        • generated_code: The code generated by the model.
        • comments: The language used for comments.
        • identifiers: The language used for identifiers.
        • literals: The language used for literals.
        • notes: Additional notes.
    6. ablation_study: Contains files for the ablation study. Each file has the following columns:

      • id: The ID of the query in the CoderEval benchmark.
      • prompt: The prompt used for code generation.
      • generated_code, comments, identifiers, and literals: Same as in the qualitative analysis. results.pdf: This file shows the table containing all the percentages of comments, identifiers and literals extracted from the csv files of the ablation study.

      Files prefixed with italian contain prompts with signatures and docstrings translated into Italian. The system prompt used is the same as the initial one (see the paper). Files with the english prefix have prompts with the original signature (in English) and the docstring in Italian. The system prompt differs as follows:

    You are an AI that only responds with Python code. You will be given a function signature and its docstring by the user. Write your full implementation (restate the function signature).
    Use a Python code block to write your response.
    Comments and identifiers must be in Italian. 
    For example:
    ```python
    print("Hello World!")

    Scripts

    The scripts directory contains all the scripts used to perform all the generations and analysis. All files are properly commented. Here a brief description of each file:

    • code_generation.py: This script automates code generation using AI models (GPT, DeepSeek, and Claude) for different programming and natural languages. It reads prompts from CSV files, generates code based on the prompts, and saves the results in structured directories. It logs the process, handles errors, and stores the generated code in separate files for each iteration.

    • computeallanalysis.py: This script performs static code analysis on generated code files using different models, languages, and programming languages. It runs various analyses (Flake8, Pylint, Lizard) depending on the programming language: for Python, it runs all three analyses, while for Java, only Lizard is executed. The results are stored in dedicated report directories for each iteration. The script ensures the creation of necessary directories and handles any errors that occur during the analysis process.

    • createtestjava.py: This script processes Java code generated by different models and languages, extracting methods using a JavaParser server. It iterates through multiple iterations of generated code, extracts the relevant method code (or uses the full code if no method is found), and stores the results in a JSONL file for each language and model combination.

    • deepseek_model.py: This function sends a request to the DeepSeek API, passing a system and user prompt, and extracts the generated code snippet based on the specified programming language. It prints the extracted code in blue to the console, and if any errors occur during the request or extraction, it prints an error message in red. If successful, it returns the extracted code snippet; otherwise, it returns None.

    • extractpmdreport.py: This script processes PMD analysis reports in SARIF format and converts them into CSV files. It extracts the contents of ZIP files containing the PMD reports, parses the SARIF file to gather analysis results, and saves the findings in a CSV file. The output includes details such as file names, rules, messages, and the count of issues found. The script iterates through multiple languages, models, and iterations, ensuring that PMD reports are properly processed and saved for each combination.

    • flake_analysis.py: The flake_analysis function runs Flake8 to analyze Python files for errors and generates a CSV report summarizing the results. It processes the output, extracting error details such as filenames, error codes, and messages. The errors are grouped by file and saved in a CSV file for easy review.

    • generatepredictionclaude_java.py: The generatecodefrom_prompt function processes a JSON file containing prompts, generates Java code using the Claude API, and saves the generated code to a new JSON file. It validates each prompt, ensures it's JSON-serializable, and sends it to the Claude API for code generation. If the generation is successful, the code is stored in a structured format, and the output is saved to a JSON file for further use.

    • generatepredictionclaude_python.py: This code defines a function generatecodefrom_prompt that processes a JSON file containing prompts, generates Python code using the Claude API, and saves the generated code to a new JSON file. It handles invalid values and ensures all prompts are

  17. Z

    Annual maps of cropland abandonment, land cover, and other derived data for...

    • data.niaid.nih.gov
    • repository.soilwise-he.eu
    Updated Jul 19, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Radeloff, Volker C. (2022). Annual maps of cropland abandonment, land cover, and other derived data for time-series analysis of cropland abandonment [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_5348286
    Explore at:
    Dataset updated
    Jul 19, 2022
    Dataset provided by
    Crawford, Christopher L.
    Yin, He
    Wilcove, David S.
    Radeloff, Volker C.
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This archive contains raw annual land cover maps, cropland abandonment maps, and accompanying derived data products to support:

    Crawford C.L., Yin, H., Radeloff, V.C., and Wilcove, D.S. 2022. Rural land abandonment is too ephemeral to provide major benefits for biodiversity and climate. Science Advances doi.org/10.1126/sciadv.abm8999.

    An archive of the analysis scripts developed for this project can be found at: https://github.com/chriscra/abandonment_trajectories (https://doi.org/10.5281/zenodo.6383127).

    Note that the label "_2022_02_07" in many file names refers to the date of the primary analysis. "dts” or “dt” refer to “data.tables," large .csv files that were manipulated using the data.table package in R (Dowle and Srinivasan 2021, http://r-datatable.com/). “Rasters” refer to “.tif” files that were processed using the raster and terra packages in R (Hijmans, 2022; https://rspatial.org/terra/; https://rspatial.org/raster).

    Data files fall into one of four categories of data derived during our analysis of abandonment: observed, potential, maximum, or recultivation. Derived datasets also follow the same naming convention, though are aggregated across sites. These four categories are as follows (using “age_dts” for our site in Shaanxi Province, China as an example):

    observed abandonment identified through our primary analysis, with a threshold of five years. These files do not have a specific label beyond the description of the file and the date of analysis (e.g., shaanxi_age_2022_02_07.csv);

    potential abandonment for a scenario without any recultivation, in which abandoned croplands are left abandoned from the year of initial abandonment through the end of the time series, with the label “_potential” (e.g., shaanxi_potential_age_2022_02_07.csv);

    maximum age of abandonment over the course of the time series, with the label “_max” (e.g., shaanxi_max_age_2022_02_07.csv);

    recultivation periods, corresponding to the lengths of recultivation periods following abandonment, given the label “_recult” (e.g., shaanxi_recult_age_2022_02_07.csv).

    This archive includes multiple .zip files, the contents of which are described below:

    age_dts.zip - Maps of abandonment age (i.e., how long each pixel has been abandoned for, as of that year, also referred to as length, duration, etc.), for each year between 1987-2017 for all 11 sites. These maps are stored as .csv files, where each row is a pixel, the first two columns refer to the x and y coordinates (in terms of longitude and latitude), and subsequent columns contain the abandonment age values for an individual year (where years are labeled with "y" followed by the year, e.g., "y1987"). Maps are given with a latitude and longitude coordinate reference system. Folder contains observed age, potential age (“_potential”), maximum age (“_max”), and recultivation lengths (“_recult”) for all sites. Maximum age .csv files include only three columns: x, y, and the maximum length (i.e., “max age”, in years) for each pixel throughout the entire time series (1987-2017). Files were produced using the custom functions "cc_filter_abn_dt()," “cc_calc_max_age()," “cc_calc_potential_age(),” and “cc_calc_recult_age();” see "_util/_util_functions.R."

    age_rasters.zip - Maps of abandonment age (i.e., how long each pixel has been abandoned for), for each year between 1987-2017 for all 11 sites. Maps are stored as .tif files, where each band corresponds to one of the 31 years in our analysis (1987-2017), in ascending order (i.e., the first layer is 1987 and the 31st layer is 2017). Folder contains observed age, potential age (“_potential”), and maximum age (“_max”) rasters for all sites. Maximum age rasters include just one band (“layer”). These rasters match the corresponding .csv files contained in "age_dts.zip.”

    derived_data.zip - summary datasets created throughout this analysis, listed below.

    diff.zip - .csv files for each of our eleven sites containing the year-to-year lagged differences in abandonment age (i.e., length of time abandoned) for each pixel. The rows correspond to a single pixel of land, and the columns refer to the year the difference is in reference to. These rows do not have longitude or latitude values associated with them; however, rows correspond to the same rows in the .csv files in "input_data.tables.zip" and "age_dts.zip." These files were produced using the custom function "cc_diff_dt()" (much like the base R function "diff()"), contained within the custom function "cc_filter_abn_dt()" (see "_util/_util_functions.R"). Folder contains diff files for observed abandonment, potential abandonment (“_potential”), and recultivation lengths (“_recult”) for all sites.

    input_dts.zip - annual land cover maps for eleven sites with four land cover classes (see below), adapted from Yin et al. 2020 Remote Sensing of Environment (https://doi.org/10.1016/j.rse.2020.111873). Like “age_dts,” these maps are stored as .csv files, where each row is a pixel and the first two columns refer to x and y coordinates (in terms of longitude and latitude). Subsequent columns contain the land cover class for an individual year (e.g., "y1987"). Note that these maps were recoded from Yin et al. 2020 so that land cover classification was consistent across sites (see below). This contains two files for each site: the raw land cover maps from Yin et al. 2020 (after recoding), and a “clean” version produced by applying 5- and 8-year temporal filters to the raw input (see custom function “cc_temporal_filter_lc(),” in “_util/_util_functions.R” and “1_prep_r_to_dt.R”). These files correspond to those in "input_rasters.zip," and serve as the primary inputs for the analysis.

    input_rasters.zip - annual land cover maps for eleven sites with four land cover classes (see below), adapted from Yin et al. 2020 Remote Sensing of Environment. Maps are stored as ".tif" files, where each band corresponds one of the 31 years in our analysis (1987-2017), in ascending order (i.e., the first layer is 1987 and the 31st layer is 2017). Maps are given with a latitude and longitude coordinate reference system. Note that these maps were recoded so that land cover classes matched across sites (see below). Contains two files for each site: the raw land cover maps (after recoding), and a “clean” version that has been processed with 5- and 8-year temporal filters (see above). These files match those in "input_dts.zip."

    length.zip - .csv files containing the length (i.e., age or duration, in years) of each distinct individual period of abandonment at each site. This folder contains length files for observed and potential abandonment, as well as recultivation lengths. Produced using the custom function "cc_filter_abn_dt()" and “cc_extract_length();” see "_util/_util_functions.R."

    derived_data.zip contains the following files:

    "site_df.csv" - a simple .csv containing descriptive information for each of our eleven sites, along with the original land cover codes used by Yin et al. 2020 (updated so that all eleven sites in how land cover classes were coded; see below).

    Primary derived datasets for both observed abandonment (“area_dat”) and potential abandonment (“potential_area_dat”).

    area_dat - Shows the area (in ha) in each land cover class at each site in each year (1987-2017), along with the area of cropland abandoned in each year following a five-year abandonment threshold (abandoned for >=5 years) or no threshold (abandoned for >=1 years). Produced using custom functions "cc_calc_area_per_lc_abn()" via "cc_summarize_abn_dts()". See scripts "cluster/2_analyze_abn.R" and "_util/_util_functions.R."

    persistence_dat - A .csv containing the area of cropland abandoned (ha) for a given "cohort" of abandoned cropland (i.e., a group of cropland abandoned in the same year, also called "year_abn") in a specific year. This area is also given as a proportion of the initial area abandoned in each cohort, or the area of each cohort when it was first classified as abandoned at year 5 ("initial_area_abn"). The "age" is given as the number of years since a given cohort of abandoned cropland was last actively cultivated, and "time" is marked relative to the 5th year, when our five-year definition first classifies that land as abandoned (and where the proportion of abandoned land remaining abandoned is 1). Produced using custom functions "cc_calc_persistence()" via "cc_summarize_abn_dts()". See scripts "cluster/2_analyze_abn.R" and "_util/_util_functions.R." This serves as the main input for our linear models of recultivation (“decay”) trajectories.

    turnover_dat - A .csv showing the annual gross gain, annual gross loss, and annual net change in the area (in ha) of abandoned cropland at each site in each year of the time series. Produced using custom functions "cc_calc_abn_diff()" via "cc_summarize_abn_dts()" (see "_util/_util_functions.R"), implemented in "cluster/2_analyze_abn.R." This file is only produced for observed abandonment.

    Area summary files (for observed abandonment only)

    area_summary_df - Contains a range of summary values relating to the area of cropland abandonment for each of our eleven sites. All area values are given in hectares (ha) unless stated otherwise. It contains 16 variables as columns, including 1) "site," 2) "total_site_area_ha_2017" - the total site area (ha) in 2017, 3) "cropland_area_1987" - the area in cropland in 1987 (ha), 4) "area_abn_ha_2017" - the area of cropland abandoned as of 2017 (ha), 5) "area_ever_abn_ha" - the total area of those pixels that were abandoned at least once during the time series (corresponding to the area of potential abandonment, as of 2017), 6) "total_crop_extent_ha" - the total area of those pixels that were classified as cropland at least once during the time series, 7)

  18. o

    AI Question Answering Data

    • opendatabay.com
    .undefined
    Updated Jul 5, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Datasimple (2025). AI Question Answering Data [Dataset]. https://www.opendatabay.com/data/ai-ml/d3c37fed-f830-444b-a988-c893d3396fd7
    Explore at:
    .undefinedAvailable download formats
    Dataset updated
    Jul 5, 2025
    Dataset authored and provided by
    Datasimple
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Area covered
    Data Science and Analytics
    Description

    This dataset provides essential information for entries related to question answering tasks using AI models. It is designed to offer valuable insights for researchers and practitioners, enabling them to effectively train and rigorously evaluate their machine learning models. The dataset serves as a valuable resource for building and assessing question-answering systems. It is available free of charge.

    Columns

    • instruction: Contains the specific instructions given to a model to generate a response.
    • responses: Includes the responses generated by the model based on the given instructions.
    • next_response: Provides the subsequent response from the model, following a previous response, which facilitates a conversational interaction.
    • answer: Lists the correct answer for each question presented in the instruction, acting as a reference for assessing the model's accuracy.
    • is_human_response: A boolean column that indicates whether a particular response was created by a human or by a machine learning model, helping to differentiate between the two. Out of nearly 19,300 entries, 254 are human-generated responses, while 18,974 were generated by models.

    Distribution

    The data files are typically in CSV format, with a dedicated train.csv file for training data and a test.csv file for testing purposes. The training file contains a large number of examples. Specific dates are not included within this dataset description, focusing solely on providing accurate and informative details about its content and purpose. Specific numbers for rows or records are not detailed in the available information.

    Usage

    This dataset is ideal for a variety of applications and use cases: * Training and Testing: Utilise train.csv to train question-answering models or algorithms, and test.csv to evaluate their performance on unseen questions. * Machine Learning Model Creation: Develop machine learning models specifically for question-answering by leveraging the instructional components, including instructions, responses, next responses, and human-generated answers, along with their is_human_response labels. * Model Performance Evaluation: Assess model performance by comparing predicted responses with actual human-generated answers from the test.csv file. * Data Augmentation: Expand existing data by paraphrasing instructions or generating alternative responses within similar contexts. * Conversational Agents: Build conversational agents or chatbots by utilising the instruction-response pairs for training. * Language Understanding: Train models to understand language and generate responses based on instructions and previous responses. * Educational Materials: Develop interactive quizzes or study guides, with models providing instant feedback to students. * Information Retrieval Systems: Create systems that help users find specific answers from large datasets. * Customer Support: Train customer support chatbots to provide quick and accurate responses to inquiries. * Language Generation Research: Develop novel algorithms for generating coherent responses in question-answering scenarios. * Automatic Summarisation Systems: Train systems to generate concise summaries by understanding main content through question answering. * Dialogue Systems Evaluation: Use the instruction-response pairs as a benchmark for evaluating dialogue system performance. * NLP Algorithm Benchmarking: Establish baselines against which other NLP tools and methods can be measured.

    Coverage

    The dataset's geographic scope is global. There is no specific time range or demographic scope noted within the available details, as specific dates are not included.

    License

    CC0

    Who Can Use It

    This dataset is highly suitable for: * Researchers and Practitioners: To gain insights into question answering tasks using AI models. * Developers: To train models, create chatbots, and build conversational agents. * Students: For developing educational materials and enhancing their learning experience through interactive tools. * Individuals and teams working on Natural Language Processing (NLP) projects. * Those creating information retrieval systems or customer support solutions. * Experts in natural language generation (NLG) and automatic summarisation systems. * Anyone involved in the evaluation of dialogue systems and machine learning model training.

    Dataset Name Suggestions

    • AI Question Answering Data
    • Conversational AI Training Data
    • NLP Question-Answering Dataset
    • Model Evaluation QA Data
    • Dialogue Response Dataset

    Attributes

    Original Data Source: Question-Answering Training and Testing Data

  19. m

    DataCo SMART SUPPLY CHAIN FOR BIG DATA ANALYSIS

    • data.mendeley.com
    • narcis.nl
    Updated Mar 12, 2019
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Fabian Constante (2019). DataCo SMART SUPPLY CHAIN FOR BIG DATA ANALYSIS [Dataset]. http://doi.org/10.17632/8gx2fvg2k6.3
    Explore at:
    Dataset updated
    Mar 12, 2019
    Authors
    Fabian Constante
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    A DataSet of Supply Chains used by the company DataCo Global was used for the analysis. Dataset of Supply Chain , which allows the use of Machine Learning Algorithms and R Software. Areas of important registered activities : Provisioning , Production , Sales , Commercial Distribution.It also allows the correlation of Structured Data with Unstructured Data for knowledge generation.

    Type Data : Structured Data : DataCoSupplyChainDataset.csv Unstructured Data : tokenized_access_logs.csv (Clickstream)

    Types of Products : Clothing , Sports , and Electronic Supplies

    Additionally it is attached in another file called DescriptionDataCoSupplyChain.csv, the description of each of the variables of the DataCoSupplyChainDatasetc.csv.

  20. f

    JSON files and CSV files

    • figshare.com
    txt
    Updated May 26, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Marie Carstensen (2025). JSON files and CSV files [Dataset]. http://doi.org/10.6084/m9.figshare.29098967.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    May 26, 2025
    Dataset provided by
    figshare
    Authors
    Marie Carstensen
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    JSON files from each site and depth (site_depth). CSV files exported from Taglab data were merged into one large file for further analysis.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
CSIRO (2014). CSV file used in statistical analyses [Dataset]. http://doi.org/10.4225/08/543B4B4CA92E6
Organization logo

CSV file used in statistical analyses

Explore at:
Dataset updated
Oct 13, 2014
Dataset authored and provided by
CSIROhttp://www.csiro.au/
License

https://research.csiro.au/dap/licences/csiro-data-licence/https://research.csiro.au/dap/licences/csiro-data-licence/

Time period covered
Mar 14, 2008 - Jun 9, 2009
Dataset funded by
CSIROhttp://www.csiro.au/
Description

A csv file containing the tidal frequencies used for statistical analyses in the paper "Estimating Freshwater Flows From Tidally-Affected Hydrographic Data" by Dan Pagendam and Don Percival.

Search
Clear search
Close search
Google apps
Main menu