100+ datasets found
  1. Reddit: /r/technology (Submissions & Comments)

    • kaggle.com
    Updated Dec 18, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2022). Reddit: /r/technology (Submissions & Comments) [Dataset]. https://www.kaggle.com/datasets/thedevastator/uncovering-technology-insights-through-reddit-di
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 18, 2022
    Dataset provided by
    Kaggle
    Authors
    The Devastator
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Reddit: /r/technology (Submissions & Comments)

    Title, Score, ID, URL, Comment Number, and Timestamp

    By Reddit [source]

    About this dataset

    This dataset, labeled as Reddit Technology Data, provides thorough insights into the conversations and interactions around technology-related topics shared on Reddit – a well-known Internet discussion forum. This dataset contains titles of discussions, scores as contributed by users on Reddit, the unique IDs attributed to different discussions, URLs of those hidden discussions (if any), comment counts in each discussion thread and timestamps of when those conversations were initiated. As such, this data is supremely valuable for tech-savvy people wanting to stay up to date with the new developments in their field or professionals looking to keep abreast with industry trends. In short, it is a repository which helps people make sense and draw meaning out of what’s happening in the technology world at large - inspiring action on their part or simply educating them about forthcoming changes

    More Datasets

    For more datasets, click here.

    Featured Notebooks

    • 🚨 Your notebook can be here! 🚨!

    How to use the dataset

    The dataset includes six columns containing title, score, url address link to the discussion page on Reddit itself ,comment count ,created time stamp meaning when was it posted/uploaded/communicated and body containing actual text written regarding that particular post/discussion. By separately analyzing each column it can be made out what type information user require in regard with various aspects related to technology based discussions. One can develop hypothesis about correlations between different factors associated with rating or comment count by separate analysis within those columns themselves like discuss what does people comment or react mostly upon viewing which type of post inside reddit ? Does high rating always come along with extremely long comments.? And many more .By researching this way one can discover real facts hidden behind social networking websites such as reddit which contains large amount of rich information regarding user’s interest in different topics related to tech gadgets or otherwise .We can analyze different trends using voice search technology etc in order visualize users overall reaction towards any kind of information shared through public forums like stack overflow sites ,facebook posts etc .These small instances will allow us gain heavy insights for research purpose thereby providing another layer for potential business opportunities one may benefit from over a given period if not periodcally monitored .

    Research Ideas

    • Companies can use this dataset to create targeted online marketing campaigns directed towards Reddit users interested in specific areas of technology.
    • Academic researchers can use the data to track and analyze trends in conversations related to technology on Reddit over time.
    • Technology professionals can utilize the comments and discussions on this dataset as a way of gauging public opinion and consumer sentiment towards certain technological advancements or products

    Acknowledgements

    If you use this dataset in your research, please credit the original authors. Data Source

    License

    License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

    Columns

    File: technology.csv | Column name | Description | |:--------------|:--------------------------------------------------------------------------| | title | The title of the discussion. (String) | | score | The score of the discussion as measured by Reddit contributors. (Integer) | | url | The website URL associated with the discussion. (String) | | comms_num | The number of comments associated with the discussion. (Integer) | | created | The date and time the discussion was created. (DateTime) | | body | The body content of the discussion. (String) | | timestamp | The timestamp of the discussion. (Integer) |

    Acknowledgements

    If you use this dataset in your research, please credit the original authors. If you use this dataset in your research, please credit Reddit.

  2. Complete In-Depth Dataset for League of Legends

    • kaggle.com
    zip
    Updated Jan 9, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Seouk Jun Kim (2021). Complete In-Depth Dataset for League of Legends [Dataset]. https://www.kaggle.com/datasets/kdanielive/lol-partchallenger-1087
    Explore at:
    zip(51507850 bytes)Available download formats
    Dataset updated
    Jan 9, 2021
    Authors
    Seouk Jun Kim
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Context

    League of Legends is a popular global online game played by millions of players monthly. In the past few years, the League of Legends e-sports industry has shown phenomenal growth. Just recently in 2020, the World Championship finals drew 3.8 million peak viewers! While the e-sports industry still lags behind traditional sports in terms of popularity and viewership, it has shown exponential growth in certain regions with fast-growing economy, such as Vietnam and China, making it a prime target for sponsorship for foreign companies looking to spread brand awareness in these regions.

    While the e-sports data industry is also showing gradual growth, there is not much available publicly in terms of published analysis of individual games. This may be due to the fact that the games are fast-changing compared to traditional sports--rules and game stats are frequently and arbitrarily changed by the developers. Nevertheless it is an interesting field for fun researches: hence the reason for many pet projects and graduate-level papers dedicated to this field.

    All existing League of Legends games (minus custom games, including ones from competitions) are made available by Riot's API. However, having to request and parse the data for every single relevant game is quite annoying; this dataset intends to save that work for you. To make things (hopefully) easier, I parsed all JSON files returned by Riot API into CSV files, with each row corresponding to one game.

    Components

    This dataset consists of three parts: root games, root2tail, and tail games.

    I found that quite often when trying to predict the outcome of a match prior to its play, the historical matches of a player prior to that game count as an important factor (Hall, 2017). For such purpose, root games contains 1087 games from which tail games branches out.

    Tail games contains historical matches of each player for every game in root games. Root2tail maps root games's each player's account ID and that player's controlled champion ID to a list of matches that can be found in tail games.

    To simplify the explanation, if you want to access historical matches of a player in root games file, 1. Get player's account ID and the game ID. 2. Load root2tail file. 3. Queue for matching row on account ID and game ID. 4. The corresponding row contains a list of game IDs that can be queued on tail_games files.

    Note that root2tail documents most recent 5 matches, or a list of matches played within the past 5 weeks, prior to the game creation date of the corresponding "root game". It also only documents the most recent games by the player played with the same champion he/she played in the "root game". In cases where there is an empty list, it means the player has not played a single match with the same champion within the past 5 weeks.

    Content

    How was this data collected?

    On 2020, December 5th, I fetched the list of current players in Challenger tier, then recursively gathered historical matches of those players to consist root games, so this is the data collection date.

    What do the rows and columns of the csv data represent?

    Root2tail is self-explanatory. As for the other files, each row represents a single game. The columns are quite confusing, however, as it is a flattened version of a JSON file with nested lists of dictionaries.

    I tried to think of the simplest way to make the columns comprehensible, but looking at the original JSON file is most likely the simplest way to understand the structure. Use tools like https://jsonformatter.curiousconcept.com/ to inspect the dummy_league_match.json file.

    A very simple explanation: participant.stats._ and participant.timeline._ contains pretty much all match-related statistics of a player during the game.

    Also, note that the "accountId" fields use encrypted account IDs which are specific to my API key. If you want to do additional research using player account IDs, you should fetch the match file first and get your own list of player account IDs.

    Acknowledgements

    The following are great resources I got a lot of help from: 1. https://riot-watcher.readthedocs.io/en/latest/ 2. https://riot-api-libraries.readthedocs.io/en/latest/

    These two actually explain everything you need to get started on your own project with Riot API.

    The following are links to related projects that could maybe help you get ideas!

    1. Kim, Seouk Jun, https://towardsdatascience.com/discussing-the-champion-specific-player-win-rate-factor-in-league-of-legends-match-prediction-3d83d7e50a94 (2020)
    2. Huang, Thomas, Kim, David, and Leung, Gregory, https://thomasythuang.github.io/League-Predictor/ (2015)
    3. Jiang, Jinhang, https://towardsdatascience.com/lol-match-prediction-using-early-laning-phase-data-machine-learning-4...
  3. BIP! DB: A Dataset of Impact Measures for Research Products

    • zenodo.org
    application/gzip
    Updated Mar 17, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Thanasis Vergoulis; Thanasis Vergoulis; Ilias Kanellos; Ilias Kanellos; Claudio Atzori; Claudio Atzori; Andrea Mannocci; Andrea Mannocci; Serafeim Chatzopoulos; Serafeim Chatzopoulos; Sandro La Bruzzo; Sandro La Bruzzo; Natalia Manola; Paolo Manghi; Paolo Manghi; Natalia Manola (2024). BIP! DB: A Dataset of Impact Measures for Research Products [Dataset]. http://doi.org/10.5281/zenodo.10804822
    Explore at:
    application/gzipAvailable download formats
    Dataset updated
    Mar 17, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Thanasis Vergoulis; Thanasis Vergoulis; Ilias Kanellos; Ilias Kanellos; Claudio Atzori; Claudio Atzori; Andrea Mannocci; Andrea Mannocci; Serafeim Chatzopoulos; Serafeim Chatzopoulos; Sandro La Bruzzo; Sandro La Bruzzo; Natalia Manola; Paolo Manghi; Paolo Manghi; Natalia Manola
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    This dataset contains citation-based impact indicators (a.k.a, "measures") for ~187,8M distinct PIDs (persistent identifiers) that correspond to research products (scientific publications, datasets, etc). In particular, for each PID, we have calculated the following indicators (organized in categories based on the semantics of the impact aspect that they better capture):

    Influence indicators (i.e., indicators of the "total" impact of each research product; how established it is in general)

    Citation Count: The total number of citations of the product, the most well-known influence indicator.

    PageRank score: An influence indicator based on the PageRank [1], a popular network analysis method. PageRank estimates the influence of each product based on its centrality in the whole citation network. It alleviates some issues of the Citation Count indicator (e.g., two products with the same number of citations can have significantly different PageRank scores if the aggregated influence of the products citing them is very different - the product receiving citations from more influential products will get a larger score).

    Popularity indicators (i.e., indicators of the "current" impact of each research product; how popular the product is currently)

    RAM score: A popularity indicator based on the RAM [2] method. It is essentially a Citation Count where recent citations are considered as more important. This type of "time awareness" alleviates problems of methods like PageRank, which are biased against recently published products (new products need time to receive a number of citations that can be indicative for their impact).

    AttRank score: A popularity indicator based on the AttRank [3] method. AttRank alleviates PageRank's bias against recently published products by incorporating an attention-based mechanism, akin to a time-restricted version of preferential attachment, to explicitly capture a researcher's preference to examine products which received a lot of attention recently.

    Impulse indicators (i.e., indicators of the initial momentum that the research product received right after its publication)

    Incubation Citation Count (3-year CC): This impulse indicator is a time-restricted version of the Citation Count, where the time window length is fixed for all products and the time window depends on the publication date of the product, i.e., only citations 3 years after each product's publication are counted.

    More details about the aforementioned impact indicators, the way they are calculated and their interpretation can be found here and in the respective references (e.g., in [5]).

    From version 5.1 onward, the impact indicators are calculated in two levels:

    • The PID level (assuming that each PID corresponds to a distinct research product).
    • The OpenAIRE-id level (leveraging PID synonyms based on OpenAIRE's deduplication algorithm [4] - each distinct article has its own OpenAIRE id).

    Previous versions of the dataset only provided the scores at the PID level.

    From version 12 onward, two types of PIDs are included in the dataset: DOIs and PMIDs (before that version, only DOIs were included).

    Also, from version 7 onward, for each product in our files we also offer an impact class, which informs the user about the percentile into which the product score belongs compared to the impact scores of the rest products in the database. The impact classes are: C1 (in top 0.01%), C2 (in top 0.1%), C3 (in top 1%), C4 (in top 10%), and C5 (in bottom 90%).

    Finally, before version 10, the calculation of the impact scores (and classes) was based on a citation network having one node for each product with a distinct PID that we could find in our input data sources. However, from version 10 onward, the nodes are deduplicated using the most recent version of the OpenAIRE article deduplication algorithm. This enabled a correction of the scores (more specifically, we avoid counting citation links multiple times when they are made by multiple versions of the same product). As a result, each node in the citation network we build is a deduplicated product having a distinct OpenAIRE id. We still report the scores at PID level (i.e., we assign a score to each of the versions/instances of the product), however these PID-level scores are just the scores of the respective deduplicated nodes propagated accordingly (i.e., all version of the same deduplicated product will receive the same scores). We have removed a small number of instances (having a PID) that were assigned (by error) to multiple deduplicated records in the OpenAIRE Graph.

    For each calculation level (PID / OpenAIRE-id) we provide five (5) compressed CSV files (one for each measure/score provided) where each line follows the format "identifier

    From version 9 onward, we also provide topic-specific impact classes for PID-identified products. In particular, we associated those products with 2nd level concepts from OpenAlex; we chose to keep only the three most dominant concepts for each product, based on their confidence score, and only if this score was greater than 0.3. Then, for each product and impact measure, we compute its class within its respective concepts. We provide finally the "topic_based_impact_classes.txt" file where each line follows the format "identifier

    The data used to produce the citation network on which we calculated the provided measures have been gathered from the OpenAIRE Graph v7.1.0, including data from (a) OpenCitations' COCI & POCI dataset, (b) MAG [6,7], and (c) Crossref. The union of all distinct citations that could be found in these sources have been considered. In addition, versions later than v.10 leverage the filtering rules described here to remove from the dataset PIDs with problematic metadata.

    References:

    [1] R. Motwani L. Page, S. Brin and T. Winograd. 1999. The PageRank Citation Ranking: Bringing Order to the Web. Technical Report. Stanford InfoLab.

    [2] Rumi Ghosh, Tsung-Ting Kuo, Chun-Nan Hsu, Shou-De Lin, and Kristina Lerman. 2011. Time-Aware Ranking in Dynamic Citation Networks. In Data Mining Workshops (ICDMW). 373–380

    [3] I. Kanellos, T. Vergoulis, D. Sacharidis, T. Dalamagas, Y. Vassiliou: Ranking Papers by their Short-Term Scientific Impact. CoRR abs/2006.00951 (2020)

    [4] P. Manghi, C. Atzori, M. De Bonis, A. Bardi, Entity deduplication in big data graphs for scholarly communication, Data Technologies and Applications (2020).

    [5] I. Kanellos, T. Vergoulis, D. Sacharidis, T. Dalamagas, Y. Vassiliou: Impact-Based Ranking of Scientific Publications: A Survey and Experimental Evaluation. TKDE 2019 (early access)

    [6] Arnab Sinha, Zhihong Shen, Yang Song, Hao Ma, Darrin Eide, Bo-June (Paul) Hsu, and Kuansan Wang. 2015. An Overview of Microsoft Academic Service (MA) and Applications. In Proceedings of the 24th International Conference on World Wide Web (WWW '15 Companion). ACM, New York, NY, USA, 243-246. DOI=http://dx.doi.org/10.1145/2740908.2742839

    [7] K. Wang et al., "A Review of Microsoft Academic Services for Science of Science Studies", Frontiers in Big Data, 2019, doi: 10.3389/fdata.2019.00045

    Find our Academic Search Engine built on top of these data here. Further note, that we also provide all calculated scores through BIP! Finder's API.

    Terms of use: These data are provided "as is", without any warranties of any kind. The data are provided under the CC0 license.

    More details about BIP! DB can be found in our relevant peer-reviewed publication:

    Thanasis Vergoulis, Ilias Kanellos, Claudio Atzori, Andrea Mannocci, Serafeim Chatzopoulos, Sandro La Bruzzo, Natalia Manola, Paolo Manghi: BIP! DB: A Dataset of Impact Measures for Scientific Publications. WWW (Companion Volume) 2021: 456-460

    We kindly request that any published research that makes use of BIP! DB cite the above article.

  4. Z

    Data from: Russian Financial Statements Database: A firm-level collection of...

    • data.niaid.nih.gov
    Updated Mar 14, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bondarkov, Sergey; Ledenev, Victor; Skougarevskiy, Dmitriy (2025). Russian Financial Statements Database: A firm-level collection of the universe of financial statements [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_14622208
    Explore at:
    Dataset updated
    Mar 14, 2025
    Dataset provided by
    European University at St Petersburg
    European University at St. Petersburg
    Authors
    Bondarkov, Sergey; Ledenev, Victor; Skougarevskiy, Dmitriy
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    The Russian Financial Statements Database (RFSD) is an open, harmonized collection of annual unconsolidated financial statements of the universe of Russian firms:

    • 🔓 First open data set with information on every active firm in Russia.

    • 🗂️ First open financial statements data set that includes non-filing firms.

    • 🏛️ Sourced from two official data providers: the Rosstat and the Federal Tax Service.

    • 📅 Covers 2011-2023 initially, will be continuously updated.

    • 🏗️ Restores as much data as possible through non-invasive data imputation, statement articulation, and harmonization.

    The RFSD is hosted on 🤗 Hugging Face and Zenodo and is stored in a structured, column-oriented, compressed binary format Apache Parquet with yearly partitioning scheme, enabling end-users to query only variables of interest at scale.

    The accompanying paper provides internal and external validation of the data: http://arxiv.org/abs/2501.05841.

    Here we present the instructions for importing the data in R or Python environment. Please consult with the project repository for more information: http://github.com/irlcode/RFSD.

    Importing The Data

    You have two options to ingest the data: download the .parquet files manually from Hugging Face or Zenodo or rely on 🤗 Hugging Face Datasets library.

    Python

    🤗 Hugging Face Datasets

    It is as easy as:

    from datasets import load_dataset import polars as pl

    This line will download 6.6GB+ of all RFSD data and store it in a 🤗 cache folder

    RFSD = load_dataset('irlspbru/RFSD')

    Alternatively, this will download ~540MB with all financial statements for 2023# to a Polars DataFrame (requires about 8GB of RAM)

    RFSD_2023 = pl.read_parquet('hf://datasets/irlspbru/RFSD/RFSD/year=2023/*.parquet')

    Please note that the data is not shuffled within year, meaning that streaming first n rows will not yield a random sample.

    Local File Import

    Importing in Python requires pyarrow package installed.

    import pyarrow.dataset as ds import polars as pl

    Read RFSD metadata from local file

    RFSD = ds.dataset("local/path/to/RFSD")

    Use RFSD_dataset.schema to glimpse the data structure and columns' classes

    print(RFSD.schema)

    Load full dataset into memory

    RFSD_full = pl.from_arrow(RFSD.to_table())

    Load only 2019 data into memory

    RFSD_2019 = pl.from_arrow(RFSD.to_table(filter=ds.field('year') == 2019))

    Load only revenue for firms in 2019, identified by taxpayer id

    RFSD_2019_revenue = pl.from_arrow( RFSD.to_table( filter=ds.field('year') == 2019, columns=['inn', 'line_2110'] ) )

    Give suggested descriptive names to variables

    renaming_df = pl.read_csv('local/path/to/descriptive_names_dict.csv') RFSD_full = RFSD_full.rename({item[0]: item[1] for item in zip(renaming_df['original'], renaming_df['descriptive'])})

    R

    Local File Import

    Importing in R requires arrow package installed.

    library(arrow) library(data.table)

    Read RFSD metadata from local file

    RFSD <- open_dataset("local/path/to/RFSD")

    Use schema() to glimpse into the data structure and column classes

    schema(RFSD)

    Load full dataset into memory

    scanner <- Scanner$create(RFSD) RFSD_full <- as.data.table(scanner$ToTable())

    Load only 2019 data into memory

    scan_builder <- RFSD$NewScan() scan_builder$Filter(Expression$field_ref("year") == 2019) scanner <- scan_builder$Finish() RFSD_2019 <- as.data.table(scanner$ToTable())

    Load only revenue for firms in 2019, identified by taxpayer id

    scan_builder <- RFSD$NewScan() scan_builder$Filter(Expression$field_ref("year") == 2019) scan_builder$Project(cols = c("inn", "line_2110")) scanner <- scan_builder$Finish() RFSD_2019_revenue <- as.data.table(scanner$ToTable())

    Give suggested descriptive names to variables

    renaming_dt <- fread("local/path/to/descriptive_names_dict.csv") setnames(RFSD_full, old = renaming_dt$original, new = renaming_dt$descriptive)

    Use Cases

    🌍 For macroeconomists: Replication of a Bank of Russia study of the cost channel of monetary policy in Russia by Mogiliat et al. (2024) — interest_payments.md

    🏭 For IO: Replication of the total factor productivity estimation by Kaukin and Zhemkova (2023) — tfp.md

    🗺️ For economic geographers: A novel model-less house-level GDP spatialization that capitalizes on geocoding of firm addresses — spatialization.md

    FAQ

    Why should I use this data instead of Interfax's SPARK, Moody's Ruslana, or Kontur's Focus?hat is the data period?

    To the best of our knowledge, the RFSD is the only open data set with up-to-date financial statements of Russian companies published under a permissive licence. Apart from being free-to-use, the RFSD benefits from data harmonization and error detection procedures unavailable in commercial sources. Finally, the data can be easily ingested in any statistical package with minimal effort.

    What is the data period?

    We provide financials for Russian firms in 2011-2023. We will add the data for 2024 by July, 2025 (see Version and Update Policy below).

    Why are there no data for firm X in year Y?

    Although the RFSD strives to be an all-encompassing database of financial statements, end users will encounter data gaps:

    We do not include financials for firms that we considered ineligible to submit financial statements to the Rosstat/Federal Tax Service by law: financial, religious, or state organizations (state-owned commercial firms are still in the data).

    Eligible firms may enjoy the right not to disclose under certain conditions. For instance, Gazprom did not file in 2022 and we had to impute its 2022 data from 2023 filings. Sibur filed only in 2023, Novatek — in 2020 and 2021. Commercial data providers such as Interfax's SPARK enjoy dedicated access to the Federal Tax Service data and therefore are able source this information elsewhere.

    Firm may have submitted its annual statement but, according to the Uniform State Register of Legal Entities (EGRUL), it was not active in this year. We remove those filings.

    Why is the geolocation of firm X incorrect?

    We use Nominatim to geocode structured addresses of incorporation of legal entities from the EGRUL. There may be errors in the original addresses that prevent us from geocoding firms to a particular house. Gazprom, for instance, is geocoded up to a house level in 2014 and 2021-2023, but only at street level for 2015-2020 due to improper handling of the house number by Nominatim. In that case we have fallen back to street-level geocoding. Additionally, streets in different districts of one city may share identical names. We have ignored those problems in our geocoding and invite your submissions. Finally, address of incorporation may not correspond with plant locations. For instance, Rosneft has 62 field offices in addition to the central office in Moscow. We ignore the location of such offices in our geocoding, but subsidiaries set up as separate legal entities are still geocoded.

    Why is the data for firm X different from https://bo.nalog.ru/?

    Many firms submit correcting statements after the initial filing. While we have downloaded the data way past the April, 2024 deadline for 2023 filings, firms may have kept submitting the correcting statements. We will capture them in the future releases.

    Why is the data for firm X unrealistic?

    We provide the source data as is, with minimal changes. Consider a relatively unknown LLC Banknota. It reported 3.7 trillion rubles in revenue in 2023, or 2% of Russia's GDP. This is obviously an outlier firm with unrealistic financials. We manually reviewed the data and flagged such firms for user consideration (variable outlier), keeping the source data intact.

    Why is the data for groups of companies different from their IFRS statements?

    We should stress that we provide unconsolidated financial statements filed according to the Russian accounting standards, meaning that it would be wrong to infer financials for corporate groups with this data. Gazprom, for instance, had over 800 affiliated entities and to study this corporate group in its entirety it is not enough to consider financials of the parent company.

    Why is the data not in CSV?

    The data is provided in Apache Parquet format. This is a structured, column-oriented, compressed binary format allowing for conditional subsetting of columns and rows. In other words, you can easily query financials of companies of interest, keeping only variables of interest in memory, greatly reducing data footprint.

    Version and Update Policy

    Version (SemVer): 1.0.0.

    We intend to update the RFSD annualy as the data becomes available, in other words when most of the firms have their statements filed with the Federal Tax Service. The official deadline for filing of previous year statements is April, 1. However, every year a portion of firms either fails to meet the deadline or submits corrections afterwards. Filing continues up to the very end of the year but after the end of April this stream quickly thins out. Nevertheless, there is obviously a trade-off between minimization of data completeness and version availability. We find it a reasonable compromise to query new data in early June, since on average by the end of May 96.7% statements are already filed, including 86.4% of all the correcting filings. We plan to make a new version of RFSD available by July.

    Licence

    Creative Commons License Attribution 4.0 International (CC BY 4.0).

    Copyright © the respective contributors.

    Citation

    Please cite as:

    @unpublished{bondarkov2025rfsd, title={{R}ussian {F}inancial {S}tatements {D}atabase}, author={Bondarkov, Sergey and Ledenev, Victor and Skougarevskiy, Dmitriy}, note={arXiv preprint arXiv:2501.05841}, doi={https://doi.org/10.48550/arXiv.2501.05841}, year={2025}}

    Acknowledgments and Contacts

    Data collection and processing: Sergey Bondarkov, sbondarkov@eu.spb.ru, Viktor Ledenev, vledenev@eu.spb.ru

    Project conception, data validation, and use cases: Dmitriy Skougarevskiy, Ph.D.,

  5. Record High Temperatures for US Cities

    • kaggle.com
    zip
    Updated Jan 18, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2023). Record High Temperatures for US Cities [Dataset]. https://www.kaggle.com/datasets/thedevastator/record-high-temperatures-for-us-cities-in-2015
    Explore at:
    zip(9955 bytes)Available download formats
    Dataset updated
    Jan 18, 2023
    Authors
    The Devastator
    Area covered
    United States
    Description

    Record High Temperatures for US Cities

    Clearly Defined Monthly Data

    By Gary Hoover [source]

    About this dataset

    This dataset contains all the record-breaking temperatures for your favorite US cities in 2015. With this information, you can prepare for any unexpected weather that may come your way in the future, or just revel in the beauty of these high heat spells from days past! With record highs spanning from January to December, stay warm (or cool) with these handy historical temperature data points

    More Datasets

    For more datasets, click here.

    Featured Notebooks

    • 🚨 Your notebook can be here! 🚨!

    How to use the dataset

    This dataset contains the record high temperatures for various US cities during the year of 2015. The dataset includes columns for each individual month, along with column for the records highs over the entire year. This data is sourced from www.weatherbase.com and can be used to analyze which cities experienced hot summers, or compare temperature variations between different regions.

    Here are some useful tips on how to work with this dataset: - Analyze individual monthly temperatures - this dataset allows you to compare high temperatures across months and locations in order to identify which areas experienced particularly hot summers or colder winters.
    - Compare annual versus monthly data - use this data to compare average annual highs against monthly highs in order to understand temperature trends at a given location throughout all four seasons of a single year, or explore how different regions vary based on yearly weather patterns as well as across given months within any one year; - Heatmap analysis - use this data plot temperature information in an interactive heatmap format in order to pinpoint particular regions that experience unique weather conditions or higher-than-average levels of warmth compared against cooler pockets of similar size geographic areas; - Statistically model the relationships between independent variables (temperature variations by month, region/city and more!) and dependent variables (e.g., tourism volumes). Use regression techniques such as linear models (OLS), ARIMA models/nonlinear transformations and other methods through statistical software such as STATA or R programming language;
    - Look into climate trends over longer periods - adjust time frames included in analyses beyond 2018 when possible by expanding upon the monthly station observations already present within the study timeframe utilized here; take advantage of digitally available historical temperature readings rather than relying only upon printed reports

    With these helpful tips, you can get started analyzing record high temperatures for US cities during 2015 using our 'Record High Temperatures for US Cities' dataset!

    Research Ideas

    • Create a heat map chart of US cities representing the highest temperature on record for each city from 2015.
    • Analyze trends in monthly high temperatures in order to predict future climate shifts and weather patterns across different US cities.
    • Track and compare monthly high temperature records for all US cities to identify regional hot spots with higher than average records and potential implications for agriculture and resource management planning

    Acknowledgements

    If you use this dataset in your research, please credit the original authors. Data Source

    License

    Unknown License - Please check the dataset description for more information.

    Columns

    File: Highest temperature on record through 2015 by US City.csv | Column name | Description | |:--------------|:--------------------------------------------------------------| | CITY | Name of the city. (String) | | JAN | Record high temperature for the month of January. (Integer) | | FEB | Record high temperature for the month of February. (Integer) | | MAR | Record high temperature for the month of March. (Integer) | | APR | Record high temperature for the month of April. (Integer) | | MAY | Record high temperature for the month of May. (Integer) | | JUN | Record high temperature for the month of June. (Integer) | | JUL | Record high temperature for the month of July. (Integer) | | AUG | Record high temperature for the month of August. (Integer) | | SEP | Record high temperature for the month of September. (Integer) | | OCT | Record high temperature for the month of October. (Integer) | | ...

  6. E

    [whales historical id's] - Confirmed right whale identifications in Cape Cod...

    • erddap.bco-dmo.org
    Updated Mar 14, 2019
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    BCO-DMO (2019). [whales historical id's] - Confirmed right whale identifications in Cape Cod Bay and adjacent waters and sighting histories, from the R/V Shearwater NEC-MB2002-1 (1998-2002), and historical records from 1980 (NEC-CoopRes project) (Northeast Consortium: Cooperative Research) [Dataset]. https://erddap.bco-dmo.org/erddap/info/bcodmo_dataset_2987/index.html
    Explore at:
    Dataset updated
    Mar 14, 2019
    Dataset provided by
    Biological and Chemical Oceanographic Data Management Office (BCO-DMO)
    Authors
    BCO-DMO
    License

    https://www.bco-dmo.org/dataset/2987/licensehttps://www.bco-dmo.org/dataset/2987/license

    Area covered
    Cape Cod, Cape Cod Bay
    Variables measured
    sex, year, region, whale_id
    Description

    Confirmed right whale identifications in Cape Cod Bay and adjacent waters and sighting histories, from the R/V Shearwater NEC-MB2002-1 (1998-2002), and historical records from 1980 (Northeast Consortium Cooperative Research project). access_formats=.htmlTable,.csv,.json,.mat,.nc,.tsv acquisition_description=Photographic Methods
    i) Identification Photographs:
    During aerial and shipboard surveys, photographs were taken on Kodak Kodachrome 200ASA color slide film, using hand-held 35-mm cameras equipped with 300-mm telephoto lenses and motor drives. From the air, photographers attempted to obtain good perpendicular photographs of the entire rostral callosity pattern and back of every right whale encountered as well as any other scars or markings. From the boat, photographers attempted to collect good oblique photographs of both sides of the head and chin, the body and the flukes. The data recorder on both platforms was responsible for keeping a written record of the roll and frame numbers shot by each photographer in the daily log.

    ii) Photo-analysis and Matching:
    Photographs of right whale callosity patterns are used as a basis for identification and cataloging of individuals, following methods developed by Payne et al (1983) and Kraus et al (1986). The cataloging of individually identified animals is based on using high quality photographs of distinctive callosity patterns (raised patches of roughened skin on the top and sides of the head), ventral pigmentation, lip ridges, and scars (Kraus et al 1986). New England Aquarium (NEAq) has curated the catalogue since 1980 and to the best of their knowledge, all photographs of right whales taken in the North Atlantic since 1935 have been included in NEAq's files. This catalogue allows scientists to enumerate the population, and, from resightings of known individuals, to monitor the animals' reproductive status, births, deaths, scarring, distribution and migrations. Since 1980, a total of 26,275 sightings of 436 individual right whales have been archived, of which 327 are thought to be alive, as of December 2001 (A. Knowlton, NEAq, pers. comm.)

    The matching process consists of separating photographs of right whales into individuals and inter-matching between days within the season. To match different sightings of the same whale, composite drawings and photographs of the callosity patterns of individual right whales are compared to a limited subset of the catalogue that includes animals with a similar appearance. For whales that look alike in the first sort, the original photographs of all probable matches are examined for callosity similarities and supplementary features, including scars, pigmentation, lip crenulations, and morphometric ratios. A match between different sightings is considered positive when the callosity pattern and at least one other feature can be independently matched by at least two experienced researchers (Kraus et al 1986). Exceptions to this multiple identifying feature requirement include whales that have unusual callosity patterns, large scars or birthmarks, or deformities so unique that matches from clear photographs can be based on only one feature. Preliminary photo-analysis and inter-matching was carried out at CCS, with matches confirmed using original photographs cataloged and archived at NEAq.

    iii) Photographic Data Archiving
    Upon completion of the matching process, all original slides were returned to CCS and incorporated into the CCS catalogue of identified right whales to update existing files, using the same numbering system as NEAq, in archival quality slide sheets. NEAq archives copies of photographs representing each sighting. Copies of photographs of individuals that are better than existing records, and photographs of newly identified whales, will be included in the NEAq master files as "type specimens" for future reference. The master files are maintained in fireproof safes at NEAq. All catalogue files are available for inspection and on-site use by contributors and collaborators. awards_0_award_nid=55048 awards_0_award_number=unknown NEC-CoopRes NOAA awards_0_funder_name=National Oceanic and Atmospheric Administration awards_0_funding_acronym=NOAA awards_0_funding_source_nid=352 cdm_data_type=Other comment=Whale sighting history P.I. Moira Brown Confirmed right whale identifications in Cape Cod Bay
    and adjacent waters 1998-2002 and sighting histories. (report appendix I) Conventions=COARDS, CF-1.6, ACDD-1.3 data_source=extract_data_as_tsv version 2.3 19 Dec 2019 defaultDataQuery=&time<now doi=10.1575/1912/bco-dmo.2987.1 infoUrl=https://www.bco-dmo.org/dataset/2987 institution=BCO-DMO instruments_0_acronym=camera instruments_0_dataset_instrument_description=35mm camera instruments_0_dataset_instrument_nid=4728 instruments_0_description=All types of photographic equipment including stills, video, film and digital systems. instruments_0_instrument_external_identifier=https://vocab.nerc.ac.uk/collection/L05/current/311/ instruments_0_instrument_name=Camera instruments_0_instrument_nid=520 instruments_0_supplied_name=Camera metadata_source=https://www.bco-dmo.org/api/dataset/2987 param_mapping={'2987': {}} parameter_source=https://www.bco-dmo.org/mapserver/dataset/2987/parameters people_0_affiliation=Massachusetts Division of Marine Fisheries people_0_person_name=Dr Daniel McKiernan people_0_person_nid=51014 people_0_role=Principal Investigator people_0_role_type=originator people_1_affiliation=Provincetown Center for Coastal Studies people_1_affiliation_acronym=PCCS people_1_person_name=Dr Moira Brown people_1_person_nid=51013 people_1_role=Co-Principal Investigator people_1_role_type=originator people_2_affiliation=Provincetown Center for Coastal Studies people_2_affiliation_acronym=PCCS people_2_person_name=Dr Charles Mayo people_2_person_nid=51015 people_2_role=Co-Principal Investigator people_2_role_type=originator people_3_affiliation=Woods Hole Oceanographic Institution people_3_affiliation_acronym=WHOI BCO-DMO people_3_person_name=Nancy Copley people_3_person_nid=50396 people_3_role=BCO-DMO Data Manager people_3_role_type=related project=NEC-CoopRes projects_0_acronym=NEC-CoopRes projects_0_description=The Northeast Consortium encourages and funds cooperative research and monitoring projects in the Gulf of Maine and Georges Bank that have effective, equal partnerships among fishermen, scientists, educators, and marine resource managers. The Northeast Consortium seeks to fund projects that will be conducted in a responsible manner. Cooperative research projects are designed to minimize any negative impacts to ecosystems or marine organisms, and be consistent with accepted ethical research practices, including the use of animals and human subjects in research, scrutiny of research protocols by an institutional board of review, etc. projects_0_geolocation=Georges Bank, Gulf of Maine projects_0_name=Northeast Consortium: Cooperative Research projects_0_project_nid=2045 projects_0_project_website=http://northeastconsortium.org/ projects_0_start_date=1999-01 sourceUrl=(local files) standard_name_vocabulary=CF Standard Name Table v55 version=1 xml_source=osprey2erddap.update_xml() v1.3

  7. Z

    Public Utility Data Liberation Project (PUDL) Data Release

    • data.niaid.nih.gov
    • zenodo.org
    Updated Feb 14, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Selvans, Zane A.; Gosnell, Christina M.; Sharpe, Austen; Norman, Bennett; Schira, Zach; Lamb, Katherine; Xia, Dazhong; Belfer, Ella (2025). Public Utility Data Liberation Project (PUDL) Data Release [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_3653158
    Explore at:
    Dataset updated
    Feb 14, 2025
    Dataset provided by
    Catalyst Cooperative
    Authors
    Selvans, Zane A.; Gosnell, Christina M.; Sharpe, Austen; Norman, Bennett; Schira, Zach; Lamb, Katherine; Xia, Dazhong; Belfer, Ella
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    PUDL v2025.2.0 Data Release

    This is our regular quarterly release for 2025Q1. It includes updates to all the datasets that are published with quarterly or higher frequency, plus initial verisons of a few new data sources that have been in the works for a while.

    One major change this quarter is that we are now publishing all processed PUDL data as Apache Parquet files, alongside our existing SQLite databases. See Data Access for more on how to access these outputs.

    Some potentially breaking changes to be aware of:

    In the EIA Form 930 – Hourly and Daily Balancing Authority Operations Report a number of new energy sources have been added, and some old energy sources have been split into more granular categories. See Changes in energy source granularity over time.

    We are now running the EPA’s CAMD to EIA unit crosswalk code for each individual year starting from 2018, rather than just 2018 and 2021, resulting in more connections between these two datasets and changes to some sub-plant IDs. See the note below for more details.

    Many thanks to the organizations who make these regular updates possible! Especially GridLab, RMI, and the ZERO Lab at Princeton University. If you rely on PUDL and would like to help ensure that the data keeps flowing, please consider joining them as a PUDL Sustainer, as we are still fundraising for 2025.

    New Data

    EIA 176

    Add a couple of semi-transformed interim EIA-176 (natural gas sources and dispositions) tables. They aren’t yet being written to the database, but are one step closer. See #3555 and PRs #3590, #3978. Thanks to @davidmudrauskas for moving this dataset forward.

    Extracted these interim tables up through the latest 2023 data release. See #4002 and #4004.

    EIA 860

    Added EIA 860 Multifuel table. See #3438 and #3946.

    FERC 1

    Added three new output tables containing granular utility accounting data. See #4057, #3642 and the table descriptions in the data dictionary:

    out_ferc1_yearly_detailed_income_statements

    out_ferc1_yearly_detailed_balance_sheet_assets

    out_ferc1_yearly_detailed_balance_sheet_liabilities

    SEC Form 10-K Parent-Subsidiary Ownership

    We have added some new tables describing the parent-subsidiary company ownership relationships reported in the SEC’s Form 10-K, Exhibit 21 “Subsidiaries of the Registrant”. Where possible these tables link the SEC filers or their subsidiary companies to the corresponding EIA utilities. This work was funded by a grant from the Mozilla Foundation. Most of the ML models and data preparation took place in the mozilla-sec-eia repository separate from the main PUDL ETL, as it requires processing hundreds of thousands of PDFs and the deployment of some ML experiment tracking infrastructure. The new tables are handed off as nearly finished products to the PUDL ETL pipeline. Note that these are preliminary, experimental data products and are known to be incomplete and to contain errors. Extracting data tables from unstructured PDFs and the SEC to EIA record linkage are necessarily probabalistic processes.

    See PRs #4026, #4031, #4035, #4046, #4048, #4050 and check out the table descriptions in the PUDL data dictionary:

    out_sec10k_parents_and_subsidiaries

    core_sec10k_quarterly_filings

    core_sec10k_quarterly_exhibit_21_company_ownership

    core_sec10k_quarterly_company_information

    Expanded Data Coverage

    EPA CEMS

    Added 2024 Q4 of CEMS data. See #4041 and #4052.

    EPA CAMD EIA Crosswalk

    In the past, the crosswalk in PUDL has used the EPA’s published crosswalk (run with 2018 data), and an additional crosswalk we ran with 2021 EIA 860 data. To ensure that the crosswalk reflects updates in both EIA and EPA data, we re-ran the EPA R code which generates the EPA CAMD EIA crosswalk with 4 new years of data: 2019, 2020, 2022 and 2023. Re-running the crosswalk pulls the latest data from the CAMD FACT API, which results in some changes to the generator and unit IDs reported on the EPA side of the crosswalk, which feeds into the creation of core_epa_assn_eia_epacamd.

    The changes only result in the addition of new units and generators in the EPA data, with no changes to matches at the plant level. However, the updates to generator and unit IDs have resulted in changes to the subplant IDs - some EIA boilers and generators which previously had no matches to EPA data have now been matched to EPA unit data, resulting in an overall reduction in the number of rows in the core_epa_assn_eia_epacamd_subplant_ids table. See issues #4039 and PR #4056 for a discussion of the changes observed in the course of this update.

    EIA 860M

    Added EIA 860m through December 2024. See #4038 and #4047.

    EIA 923

    Added EIA 923 monthly data through September 2024. See #4038 and #4047.

    EIA Bulk Electricity Data

    Updated the EIA Bulk Electricity data to include data published up through 2024-11-01. See #4042 and PR #4051.

    EIA 930

    Updated the EIA 930 data to include data published up through the beginning of February 2025. See #4040 and PR #4054. 10 new energy sources were added and 3 were retired; see Changes in energy source granularity over time for more information.

    Bug Fixes

    Fix an accidentally swapped set of starting balance / ending balance column rename parameters in the pre-2021 DBF derived data that feeds into core_ferc1_yearly_other_regulatory_liabilities_sched278. See issue #3952 and PRs #3969, #3979. Thanks to @yolandazzz13 for making this fix.

    Added preliminary data validation checks for several FERC 1 tables that were missing it #3860.

    Fix spelling of Lake Huron and Lake Saint Clair in out_vcerare_hourly_available_capacity_factor and related tables. See issue #4007 and PR #4029.

    Quality of Life Improvements

    We added a sources parameter to pudl.metadata.classes.DataSource.from_id() in order to make it possible to use the pudl-archiver repository to archive datasets that won’t necessarily be ingested into PUDL. See this PUDL archiver issue and PRs #4003 and #4013.

    Other PUDL v2025.2.0 Resources

    PUDL v2025.2.0 Data Dictionary

    PUDL v2025.2.0 Documentation

    PUDL in the AWS Open Data Registry

    PUDL v2025.2.0 in a free, public AWS S3 bucket: s3://pudl.catalyst.coop/v2025.2.0/

    PUDL v2025.2.0 in a requester-pays GCS bucket: gs://pudl.catalyst.coop/v2025.2.0/

    Zenodo archive of the PUDL GitHub repo for this release

    PUDL v2025.2.0 release on GitHub

    PUDL v2025.2.0 package in the Python Package Index (PyPI)

    Contact Us

    If you're using PUDL, we would love to hear from you! Even if it's just a note to let us know that you exist, and how you're using the software or data. Here's a bunch of different ways to get in touch:

    Follow us on GitHub

    Use the PUDL Github issue tracker to let us know about any bugs or data issues you encounter

    GitHub Discussions is where we provide user support.

    Watch our GitHub Project to see what we're working on.

    Email us at hello@catalyst.coop for private communications.

    On Mastodon: @CatalystCoop@mastodon.energy

    On BlueSky: @catalyst.coop

    On Twitter: @CatalystCoop

    Connect with us on LinkedIn

    Play with our data and notebooks on Kaggle

    Combine our data with ML models on HuggingFace

    Learn more about us on our website: https://catalyst.coop

    Subscribe to our announcements list for email updates.

  8. E

    Fugro Cruise C16185 Line 1040, 75 kHz VMADCP

    • gcoos5.geos.tamu.edu
    • data.ioos.us
    • +2more
    Updated Sep 21, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rosemary Smith (2017). Fugro Cruise C16185 Line 1040, 75 kHz VMADCP [Dataset]. https://gcoos5.geos.tamu.edu/erddap/info/C16185_075_Line1040_0/index.html
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 21, 2017
    Dataset provided by
    Gulf of Mexico Coastal Ocean Observing System (GCOOS)
    Authors
    Rosemary Smith
    Time period covered
    May 21, 2006
    Area covered
    Variables measured
    u, v, pg, amp, time, depth, pflag, uship, vship, heading, and 5 more
    Description

    Program of vessel mount ADCP measurements comprising a combination of 300kHz and 75kHz ADCP data collected in the vicinity of the Loop Current and drilling blocks between 2004 and 2007. _NCProperties=version=2,netcdf=4.7.4,hdf5=1.12.0, acknowledgement=Data collection funded by various oil industry operators cdm_data_type=TrajectoryProfile cdm_profile_variables=time cdm_trajectory_variables=trajectory CODAS_processing_note=

    CODAS processing note:

    Overview

    The CODAS database is a specialized storage format designed for shipboard ADCP data. "CODAS processing" uses this format to hold averaged shipboard ADCP velocities and other variables, during the stages of data processing. The CODAS database stores velocity profiles relative to the ship as east and north components along with position, ship speed, heading, and other variables. The netCDF short form contains ocean velocities relative to earth, time, position, transducer temperature, and ship heading; these are designed to be "ready for immediate use". The netCDF long form is just a dump of the entire CODAS database. Some variables are no longer used, and all have names derived from their original CODAS names, dating back to the late 1980's.

    Post-processing

    CODAS post-processing, i.e. that which occurs after the single-ping profiles have been vector-averaged and loaded into the CODAS database, includes editing (using automated algorithms and manual tools), rotation and scaling of the measured velocities, and application of a time-varying heading correction. Additional algorithms developed more recently include translation of the GPS positions to the transducer location, and averaging of ship's speed over the times of valid pings when Percent Good is reduced. Such post-processing is needed prior to submission of "processed ADCP data" to JASADCP or other archives.

    Full CODAS processing

    Whenever single-ping data have been recorded, full CODAS processing provides the best end product.

    Full CODAS processing starts with the single-ping velocities in beam coordinates. Based on the transducer orientation relative to the hull, the beam velocities are transformed to horizontal, vertical, and "error velocity" components. Using a reliable heading (typically from the ship's gyro compass), the velocities in ship coordinates are rotated into earth coordinates.

    Pings are grouped into an "ensemble" (usually 2-5 minutes duration) and undergo a suite of automated editing algorithms (removal of acoustic interference; identification of the bottom; editing based on thresholds; and specialized editing that targets CTD wire interference and "weak, biased profiles". The ensemble of single-ping velocities is then averaged using an iterative reference layer averaging scheme. Each ensemble is approximated as a single function of depth, with a zero-average over a reference layer plus a reference layer velocity for each ping. Adding the average of the single-ping reference layer velocities to the function of depth yields the ensemble-average velocity profile. These averaged profiles, along with ancillary measurements, are written to disk, and subsequently loaded into the CODAS database. Everything after this stage is "post-processing".

    note (time):

    Time is stored in the database using UTC Year, Month, Day, Hour, Minute, Seconds. Floating point time "Decimal Day" is the floating point interval in days since the start of the year, usually the year of the first day of the cruise.

    note (heading):

    CODAS processing uses heading from a reliable device, and (if available) uses a time-dependent correction by an accurate heading device. The reliable heading device is typically a gyro compass (for example, the Bridge gyro). Accurate heading devices can be POSMV, Seapath, Phins, Hydrins, MAHRS, or various Ashtech devices; this varies with the technology of the time. It is always confusing to keep track of the sign of the heading correction. Headings are written degrees, positive clockwise. setting up some variables:

    X = transducer angle (CONFIG1_heading_bias) positive clockwise (beam 3 angle relative to ship) G = Reliable heading (gyrocompass) A = Accurate heading dh = G - A = time-dependent heading correction (ANCIL2_watrk_hd_misalign)

    Rotation of the measured velocities into the correct coordinate system amounts to (u+i*v)*(exp(i*theta)) where theta is the sum of the corrected heading and the transducer angle.

    theta = X + (G - dh) = X + G - dh

    Watertrack and Bottomtrack calibrations give an indication of the residual angle offset to apply, for example if mean and median of the phase are all 0.5 (then R=0.5). Using the "rotate" command, the value of R is added to "ANCIL2_watrk_hd_misalign".

    new_dh = dh + R

    Therefore the total angle used in rotation is

    new_theta = X + G - dh_new = X + G - (dh + R) = (X - R) + (G - dh)

    The new estimate of the transducer angle is: X - R ANCIL2_watrk_hd_misalign contains: dh + R

    ====================================================

    Profile flags

    Profile editing flags are provided for each depth cell:

    binary decimal below Percent value value bottom Good bin -------+----------+--------+----------+-------+ 000 0 001 1 bad 010 2 bad 011 3 bad bad 100 4 bad 101 5 bad bad 110 6 bad bad 111 7 bad bad bad -------+----------+--------+----------+-------+

    CODAS_variables= Variables in this CODAS short-form Netcdf file are intended for most end-user scientific analysis and display purposes. For additional information see the CODAS_processing_note global attribute and the attributes of each of the variables.

    ============= ================================================================= time Time at the end of the ensemble, days from start of year. lon, lat Longitude, Latitude from GPS at the end of the ensemble. u,v Ocean zonal and meridional velocity component profiles. uship, vship Zonal and meridional velocity components of the ship. heading Mean ship heading during the ensemble. depth Bin centers in nominal meters (no sound speed profile correction). tr_temp ADCP transducer temperature. pg Percent Good pings for u, v averaging after editing. pflag Profile Flags based on editing, used to mask u, v. amp Received signal strength in ADCP-specific units; no correction for spreading or attenuation. ============= =================================================================

    contributor_name=RPS contributor_role=editor contributor_role_vocabulary=https://vocab.nerc.ac.uk/collection/G04/current/ Conventions=CF-1.6, ACDD-1.3, IOOS Metadata Profile Version 1.2, COARDS cruise_id=Fugro_wh75 description=Shipboard ADCP velocity profiles from Fugro_wh75 using instrument wh75 Easternmost_Easting=-89.72401111111111 featureType=TrajectoryProfile geospatial_bounds=LINESTRING (-90.02310555555556 27.051475, -89.72401111111111 27.237783333333333) geospatial_bounds_crs=EPSG:4326 geospatial_bounds_vertical_crs=EPSG:5703 geospatial_lat_max=27.237783333333333 geospatial_lat_min=27.051475 geospatial_lat_units=degrees_north geospatial_lon_max=-89.72401111111111 geospatial_lon_min=-90.02310555555556 geospatial_lon_units=degrees_east geospatial_vertical_max=651.63 geospatial_vertical_min=27.63 geospatial_vertical_positive=down geospatial_vertical_units=m hg_changeset=2924:48293b7d29a9 history=Created: 2019-07-15 17:47:26 UTC id=C16185_075_Line1040_0 infoUrl=ADD ME institution=GCOOS instrument=In Situ/Laboratory Instruments > Profilers/Sounders > Acoustic Sounders > ADCP > Acoustic Doppler Current Profiler keywords_vocabulary=GCMD Science Keywords naming_authority=edu.tamucc.gulfhub Northernmost_Northing=27.237783333333333 platform=ship platform_vocabulary=https://mmisw.org/ont/ioos/platform processing_level=QA'ed and checked by Oceanographer program=Oil and Gas Loop Current VMADCP Program project=O&G LC VMADCP Program software=pycurrents sonar=wh75 source=Current profiler sourceUrl=(local files) Southernmost_Northing=27.051475 standard_name_vocabulary=CF Standard Name Table v67 subsetVariables=time, longitude, latitude, depth, u, v time_coverage_duration=P0Y0M0DT4H14M9S time_coverage_end=2006-05-21T12:09:36Z time_coverage_resolution=P0Y0M0DT0H5M0S time_coverage_start=2006-05-21T07:55:27Z Westernmost_Easting=-90.02310555555556 yearbase=2006

  9. Global maps of current (1979-2013) and future (2061-2080) habitat...

    • data.niaid.nih.gov
    • research.science.eus
    • +3more
    zip
    Updated Jun 9, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Robin Pouteau; Idoia Biurrun; Caroline Brunel; Milan Chytrý; Wayne Dawson; Franz Essl; Trevor Fristoe; Rense Haveman; Carsten Hobohm; Florian Jansen; Holger Kreft; Jonathan Lenoir; Bernd Lenzner; Carsten Meyer; Jesper Erenskjold Moeslund; Jan Pergl; Petr Pyšek; Jens-Christian Svenning; Wilfried Thuiller; Patrick Weigelt; Thomas Wohlgemuth; Qiang Yang; Mark van Kleunen (2021). Global maps of current (1979-2013) and future (2061-2080) habitat suitability probability for 1,485 European endemic plant species [Dataset]. http://doi.org/10.5061/dryad.qv9s4mwf3
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jun 9, 2021
    Dataset provided by
    Research Institute for Developmenthttps://en.ird.fr/
    German Centre for Integrative Biodiversity Research (iDiv)https://www.idiv.de/
    Swiss Federal Institute for Forest, Snow and Landscape Research
    Aarhus University
    University of Göttingen
    Europa-Universität Flensburg
    Université de Picardie Jules Verne
    Université de Montpellier
    University of Rostock
    Masaryk University
    University of the Basque Country
    University of Vienna
    Durham University
    Université Grenoble Alpes
    University of Konstanz
    Czech Academy of Sciences
    Ministry of the Interior and Kingdom Relations
    Authors
    Robin Pouteau; Idoia Biurrun; Caroline Brunel; Milan Chytrý; Wayne Dawson; Franz Essl; Trevor Fristoe; Rense Haveman; Carsten Hobohm; Florian Jansen; Holger Kreft; Jonathan Lenoir; Bernd Lenzner; Carsten Meyer; Jesper Erenskjold Moeslund; Jan Pergl; Petr Pyšek; Jens-Christian Svenning; Wilfried Thuiller; Patrick Weigelt; Thomas Wohlgemuth; Qiang Yang; Mark van Kleunen
    License

    https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

    Area covered
    Europe
    Description

    Aims: The rapid increase in the number of species that have naturalized beyond their native range is among the most apparent features of the Anthropocene. How alien species will respond to other processes of future global changes is an emerging concern and remains largely misunderstood. We therefore ask whether naturalized species will respond to climate and land-use change differently than those species not yet naturalized anywhere in the world.

    Location: Global

    Methods: We investigated future changes in the potential alien range of vascular plant species endemic to Europe that are either naturalized (n = 272) or not yet naturalized (1,213) outside of Europe. Potential ranges were estimated based on projections of species distribution models using 20 future climate-change scenarios. We mapped current and future global centres of naturalization risk. We also analyzed expected changes in latitudinal, elevational and areal extent of species’ potential alien ranges.

    Results: We showed a large potential for more worldwide naturalizations of European plants currently and in the future. The centres of naturalization risk for naturalized and non-naturalized plants largely overlapped, and their location did not change much under projected future climates. Nevertheless, naturalized plants had their potential range shifting poleward over larger distances, whereas the non-naturalized ones had their potential elevational ranges shifting further upslope under the most severe climate change scenarios. As a result, climate and land-use changes are predicted to shrink the potential alien range of European plants, but less so for already naturalized than for non-naturalized species.

    Main conclusions: While currently non-naturalized plants originate frequently from mountain ranges or boreal and Mediterranean biomes in Europe, the naturalized ones usually occur at low elevations, close to human centres of activities. As the latter are expected to increase worldwide, this could explain why the potential alien range of already naturalized plants will shrink less.

    Methods Modelling the potential alien ranges of plant species under current climatic and land-use conditions

    Species selection

    We focused exclusively on vascular plant species endemic to Europe. Here, ‘Europe’ is used in a geographical sense and defined as bordered by the Arctic Ocean to the north, the Atlantic Ocean to the west (the Macaronesian archipelagos were excluded), the Ural Mountains and the Caspian Sea to the east, and the Lesser Caucasus and the Mediterranean Sea to the south (Mediterranean islands included, Anatolia excluded).

    The most recent version of the database ‘Endemic vascular plants in Europe’ (EvaplantE; Hobohm, 2014), containing > 6,200 endemic plant taxa, was used here as a baseline for species selection. Scientific names were standardized based on The Plant List (http://www.theplantlist.org/). This taxonomic standardization was done with the R package ‘Taxonstand’ (Cayuela et al., 2017). Infraspecific taxa were excluded from the list, resulting in 4,985 species.

    Compilation of occurrence records

    To comprehensively compile the distribution of our studied set of endemic species in their native continent, we combined occurrence data in Europe from five sources. The first source was the ‘Global Biodiversity Information Facility’ (GBIF), one of the largest and most widely used biodiversity databases (https://www.gbif.org/). Currently, GBIF provides access to more than 600,000 distributional records for European endemic plant species. Records of European endemic plants deemed erroneous were discarded. All occurrences from GBIF were downloaded using the R package ‘rgbif’ (Chamberlain et al., 2019). The second source was the ‘EU-Forest’ dataset, providing information on European tree species distribution, including more than half a million occurrences at a 1 km (~ 50 arcsec at 50° latitude) resolution (Mauri et al., 2017). The third source we used was the ‘European Vegetation Archive’ (EVA), which assembles observations from more than one million vegetation plots across Europe (Chytrý et al., 2016). The fourth source was the digital version of the Atlas Florae Europaeae offering gridded maps. The fifth source was the ‘Plant Functional Diversity of Grasslands’ network (DIVGRASS), combining data on plant diversity across ~ 70,000 vegetation plots in French permanent grasslands (Violle et al., 2015).

    When several occurrences from these different sources were duplicated on the same 0.42° × 0.42° grid cell, only one record was kept to avoid pseudoreplication. After removing duplicate records, species with fewer than 10 occurrences were not further considered since the resulting SDM might be insufficiently accurate. The final dataset comprised 104,313 occurrences for 1,485 European endemic species.

    Environmental variables

    We selected six environmental predictors related to climate, soil physico-chemical properties and land use, commonly considered to shape the spatial distribution of plants (Gurevitch et al., 2006). Annual mean temperature (°C), annual sum of precipitation (mm) and precipitation seasonality representing the period 1979-2013 were extracted from the CHELSA climate model at a 30 arcsec resolution (Karger et al., 2017). Organic carbon content (g per kg) and soil pH in the first 15 cm of topsoil were extracted at a 1 km resolution from the global gridded soil information database SoilGrids (Hengl et al., 2014). The proportion of primary land-cover (land with natural vegetation that has not been subject to human activity since 1500) averaged over the period 1979-2013 in each 0.5° resolution grid cell (variable ‘gothr’) based on the Harmonized Global Land Use dataset was also used (Chini et al., 2014). Environmental variables were aggregated at a spatial resolution of 0.42° × 0.42° to approach the cell size of the occurrence records with the coarsest resolution (i.e. the Atlas Florae Europaeae).

    Species distribution modelling

    The potential distribution of 1,485 European endemic plant species was predicted by estimating environmental similarity to the sites of occurrence in Europe. To increase robustness of the predictions, we used six methods to generate species distribution models (SDMs): generalized additive models; generalized linear models; generalized boosting trees; maximum entropy; multivariate adaptive regression splines; and random forests. We evaluated the predictive performance of each SDM using a repeated split sampling approach in which SDMs were calibrated over 75% of the data and evaluated over the remaining 25%. This procedure was repeated 10 times. The evaluation was performed by measuring the area under the receiver operating characteristic (ROC) curve (AUC) and the true skill statistic (TSS). Continuous model predictions were transformed into binary ones by selecting the threshold maximizing TSS to ensure the most accurate predictions since it is based on both sensitivity and specificity.

    Results of the different SDM methods were aggregated into a single consensus projection (i.e. map) to reduce uncertainties associated with each technique. To ensure the quality of the ensemble SDMs, we only kept the projections for which the accuracy estimated by AUC and TSS were higher than 0.8 and 0.6, respectively, and assembled the selected SDMs using a committee-average approach with a weight proportional to their TSS evaluation. The entire species distribution modelling process was performed within the ‘biomod2’ R platform (Thuiller et al., 2009).

    Modelling the potential alien ranges of plant species under future climatic conditions

    To model the potential spread of the European endemic flora outside of Europe in the future (period 2061-2080), we used projections for the four representative concentration pathways (RCPs) of both climate and land cover data for the years 2061-2080. Due to substantial climatic differences predicted by different general circulation models (GCMs), which result in concomitant differences in species range projections, simulations of future climate variables were based on five different GCMs: CCSM4, CESM1-CAM5, CSIRO-mk3-6-0, IPSL-CM5A-LR and MIROC5.

    References

    Cayuela, L., Stein, A., & Oksanen, J. (2017). Taxonstand: taxonomic standardization of plant species names v.2.1. R Foundation for Statistical Computing. Available at https://cran.r-project.org/web/packages/Taxonstand/index.html.

    Chamberlain, S., Barve, V., Desmet, P., Geffert, L., Mcglinn, D., Oldoni, D., & Ram, K. (2019). rgbif: interface to the Global 'Biodiversity' Information Facility API v.1.3.0. R Foundation for Statistical Computing. Available at https://cran.r-project.org/web/packages/rgbif/index.html.

    Chini, L.P., Hurtt, G.C., & Frolking, S. (2014). Harmonized Global Land Use for Years 1500 – 2100, V1. Data set. Oak Ridge National Laboratory Distributed Active Archive Center, USA. Available at http://daac.ornl.gov

    Chytrý, M., Hennekens, S. M., Jiménez-Alfaro, B., Knollová, I., Dengler, J., Jansen, F., … Yamalov, S. (2016). European Vegetation Archive (EVA): an integrated database of European vegetation plots. Applied Vegetation Science, 19, 173–180.

    Hengl, T., de Jesus, J. M., MacMillan, R. A., Batjes, N. H., Heuvelink, G. B. M., Ribeiro, E., … Gonzalez, M. R. (2014). SoilGrids1km — Global Soil Information Based on Automated Mapping. PLoS ONE, 9, e105992.

    Hobohm, C. (Ed.) (2014). Endemism in Vascular Plants. [Plant and Vegetation 9]. Dordrecht, The Netherlands: Springer.

    Karger, D.N., Conrad, O., Böhner, J., Kawohl, T., Kreft, H., Soria-Auza, R.W., … Kessler, M. (2017). Climatologies at high resolution for the earth’s land surface areas. Scientific Data, 4, 170122.

    Mauri, A., Strona, G., & San-Miguel-Ayanz, J. (2017). EU-Forest, a high-resolution tree occurrence dataset for Europe. Scientific Data, 4,

  10. n

    Data from: Using convolutional neural networks to efficiently extract...

    • data.niaid.nih.gov
    • dataone.org
    • +1more
    zip
    Updated Jan 4, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rachel Reeb; Naeem Aziz; Samuel Lapp; Justin Kitzes; J. Mason Heberling; Sara Kuebbing (2022). Using convolutional neural networks to efficiently extract immense phenological data from community science images [Dataset]. http://doi.org/10.5061/dryad.mkkwh7123
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jan 4, 2022
    Dataset provided by
    University of Pittsburgh
    Carnegie Museum of Natural History
    Authors
    Rachel Reeb; Naeem Aziz; Samuel Lapp; Justin Kitzes; J. Mason Heberling; Sara Kuebbing
    License

    https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

    Description

    Community science image libraries offer a massive, but largely untapped, source of observational data for phenological research. The iNaturalist platform offers a particularly rich archive, containing more than 49 million verifiable, georeferenced, open access images, encompassing seven continents and over 278,000 species. A critical limitation preventing scientists from taking full advantage of this rich data source is labor. Each image must be manually inspected and categorized by phenophase, which is both time-intensive and costly. Consequently, researchers may only be able to use a subset of the total number of images available in the database. While iNaturalist has the potential to yield enough data for high-resolution and spatially extensive studies, it requires more efficient tools for phenological data extraction. A promising solution is automation of the image annotation process using deep learning. Recent innovations in deep learning have made these open-source tools accessible to a general research audience. However, it is unknown whether deep learning tools can accurately and efficiently annotate phenophases in community science images. Here, we train a convolutional neural network (CNN) to annotate images of Alliaria petiolata into distinct phenophases from iNaturalist and compare the performance of the model with non-expert human annotators. We demonstrate that researchers can successfully employ deep learning techniques to extract phenological information from community science images. A CNN classified two-stage phenology (flowering and non-flowering) with 95.9% accuracy and classified four-stage phenology (vegetative, budding, flowering, and fruiting) with 86.4% accuracy. The overall accuracy of the CNN did not differ from humans (p = 0.383), although performance varied across phenophases. We found that a primary challenge of using deep learning for image annotation was not related to the model itself, but instead in the quality of the community science images. Up to 4% of A. petiolata images in iNaturalist were taken from an improper distance, were physically manipulated, or were digitally altered, which limited both human and machine annotators in accurately classifying phenology. Thus, we provide a list of photography guidelines that could be included in community science platforms to inform community scientists in the best practices for creating images that facilitate phenological analysis.

    Methods Creating a training and validation image set

    We downloaded 40,761 research-grade observations of A. petiolata from iNaturalist, ranging from 1995 to 2020. Observations on the iNaturalist platform are considered “research-grade if the observation is verifiable (includes image), includes the date and location observed, is growing wild (i.e. not cultivated), and at least two-thirds of community users agree on the species identification. From this dataset, we used a subset of images for model training. The total number of observations in the iNaturalist dataset are heavily skewed towards more recent years. Less than 5% of the images we downloaded (n=1,790) were uploaded between 1995-2016, while over 50% of the images were uploaded in 2020. To mitigate temporal bias, we used all available images between the years 1995 and 2016 and we randomly selected images uploaded between 2017-2020. We restricted the number of randomly-selected images in 2020 by capping the number of 2020 images to approximately the number of 2019 observations in the training set. The annotated observation records are available in the supplement (supplementary data sheet 1). The majority of the unprocessed records (those which hold a CC-BY-NC license) are also available on GBIF.org (2021).

    One of us (R. Reeb) annotated the phenology of training and validation set images using two different classification schemes: two-stage (non-flowering, flowering) and four-stage (vegetative, budding, flowering, fruiting). For the two-stage scheme, we classified 12,277 images and designated images as ‘flowering’ if there was one or more open flowers on the plant. All other images were classified as non-flowering. For the four-stage scheme, we classified 12,758 images. We classified images as ‘vegetative’ if no reproductive parts were present, ‘budding’ if one or more unopened flower buds were present, ‘flowering’ if at least one opened flower was present, and ‘fruiting’ if at least one fully-formed fruit was present (with no remaining flower petals attached at the base). Phenology categories were discrete; if there was more than one type of reproductive organ on the plant, the image was labeled based on the latest phenophase (e.g. if both flowers and fruits were present, the image was classified as fruiting).

    For both classification schemes, we only included images in the model training and validation dataset if the image contained one or more plants with clearly visible reproductive parts were clear and we could exclude the possibility of a later phenophase. We removed 1.6% of images from the two-stage dataset that did not meet this requirement, leaving us with a total of 12,077 images, and 4.0% of the images from the four-stage leaving us with a total of 12,237 images. We then split the two-stage and four-stage datasets into a model training dataset (80% of each dataset) and a validation dataset (20% of each dataset).

    Training a two-stage and four-stage CNN

    We adapted techniques from studies applying machine learning to herbarium specimens for use with community science images (Lorieul et al. 2019; Pearson et al. 2020). We used transfer learning to speed up training of the model and reduce the size requirements for our labeled dataset. This approach uses a model that has been pre-trained using a large dataset and so is already competent at basic tasks such as detecting lines and shapes in images. We trained a neural network (ResNet-18) using the Pytorch machine learning library (Psake et al. 2019) within Python. We chose the ResNet-18 neural network because it had fewer convolutional layers and thus was less computationally intensive than pre-trained neural networks with more layers. In early testing we reached desired accuracy with the two-stage model using ResNet-18. ResNet-18 was pre-trained using the ImageNet dataset, which has 1,281,167 images for training (Deng et al. 2009). We utilized default parameters for batch size (4), learning rate (0.001), optimizer (stochastic gradient descent), and loss function (cross entropy loss). Because this led to satisfactory performance, we did not further investigate hyperparameters.

    Because the ImageNet dataset has 1,000 classes while our data was labeled with either 2 or 4 classes, we replaced the final fully-connected layer of the ResNet-18 architecture with fully-connected layers containing an output size of 2 for the 2-class problem and 4 for the 4-class problem. We resized and cropped the images to fit ResNet’s input size of 224x224 pixels and normalized the distribution of the RGB values in each image to a mean of zero and a standard deviation of one, to simplify model calculations. During training, the CNN makes predictions on the labeled data from the training set and calculates a loss parameter that quantifies the model’s inaccuracy. The slope of the loss in relation to model parameters is found and then the model parameters are updated to minimize the loss value. After this training step, model performance is estimated by making predictions on the validation dataset. The model is not updated during this process, so that the validation data remains ‘unseen’ by the model (Rawat and Wang 2017; Tetko et al. 1995). This cycle is repeated until the desired level of accuracy is reached. We trained our model for 25 of these cycles, or epochs. We stopped training at 25 epochs to prevent overfitting, where the model becomes trained too specifically for the training images and begins to lose accuracy on images in the validation dataset (Tetko et al. 1995).

    We evaluated model accuracy and created confusion matrices using the model’s predictions on the labeled validation data. This allowed us to evaluate the model’s accuracy and which specific categories are the most difficult for the model to distinguish. For using the model to make phenology predictions on the full, 40,761 image dataset, we created a custom dataloader function in Pytorch using the Custom Dataset function, which would allow for loading images listed in a csv and passing them through the model associated with unique image IDs.

    Hardware information

    Model training was conducted using a personal laptop (Ryzen 5 3500U cpu and 8 GB of memory) and a desktop computer (Ryzen 5 3600 cpu, NVIDIA RTX 3070 GPU and 16 GB of memory).

    Comparing CNN accuracy to human annotation accuracy

    We compared the accuracy of the trained CNN to the accuracy of seven inexperienced human scorers annotating a random subsample of 250 images from the full, 40,761 image dataset. An expert annotator (R. Reeb, who has over a year’s experience in annotating A. petiolata phenology) first classified the subsample images using the four-stage phenology classification scheme (vegetative, budding, flowering, fruiting). Nine images could not be classified for phenology and were removed. Next, seven non-expert annotators classified the 241 subsample images using an identical protocol. This group represented a variety of different levels of familiarity with A. petiolata phenology, ranging from no research experience to extensive research experience (two or more years working with this species). However, no one in the group had substantial experience classifying community science images and all were naïve to the four-stage phenology scoring protocol. The trained CNN was also used to classify the subsample images. We compared human annotation accuracy in each phenophase to the accuracy of the CNN using students

  11. Reddit: /r/Tinder

    • kaggle.com
    zip
    Updated Dec 19, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2022). Reddit: /r/Tinder [Dataset]. https://www.kaggle.com/datasets/thedevastator/uncovering-online-dating-trends-with-reddit-s-ti
    Explore at:
    zip(157055 bytes)Available download formats
    Dataset updated
    Dec 19, 2022
    Authors
    The Devastator
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Reddit: /r/Tinder

    Examining User Behaviors and Attitudes

    By Reddit [source]

    About this dataset

    This dataset provides an in-depth exploration of the world of online dating, based on data mined from Reddit's Tinder subreddit. Through analysis of the six columns titled title, score, id, url, comms_num and created (which include information such as social norms and user behaviors related to online dating), this dataset can teach us valuable insights into how people are engaging with digital media and their attitudes towards it. Unveiling potential dangers such as safety risks and scams that can arise from online dating activities is also possible with this data. Its findings are paramount for anyone interested in understanding how relationships develop on a digital platform – both for researchers uncovering the sociotechnical aspects of online dating behavior and for companies seeking further insight into their user's perspectives. All in all, this dataset might just hold all the missing pieces to understanding our current relationship dynamic!

    More Datasets

    For more datasets, click here.

    Featured Notebooks

    • 🚨 Your notebook can be here! 🚨!

    How to use the dataset

    This dataset provides a comprehensive overview of online dating trends and behaviors observed on Reddit's Tinder subreddit. This data can be used to analyze user opinions, investigate user experiences, and discover online dating trends. To utilize this dataset effectively, there are several steps an individual can take to gain insights from the data:

    Research Ideas

    • Using the dataset to examine how online dating trends vary geographically and by demographics (gender, age, race etc.)
    • Analyzing the language used in posts for insights into user attitudes towards online dating.
    • Creating a machine learning model to predict a post's score based on its title, body and other features of the data set can help digital media companies better target their marketing efforts towards more successful posts on Tinder subreddits

    Acknowledgements

    If you use this dataset in your research, please credit the original authors. Data Source

    License

    License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

    Columns

    File: Tinder.csv | Column name | Description | |:--------------|:--------------------------------------------------------| | title | The title of the post. (String) | | score | The number of upvotes the post has received. (Integer) | | url | The URL of the post. (String) | | comms_num | The number of comments the post has received. (Integer) | | created | The date and time the post was created. (DateTime) | | body | The body of the post. (String) | | timestamp | The timestamp of the post. (Integer) |

    Acknowledgements

    If you use this dataset in your research, please credit the original authors. If you use this dataset in your research, please credit Reddit.

  12. E

    Fugro Cruise C16185 Line 1388, 75 kHz VMADCP

    • gcoos5.geos.tamu.edu
    • data.ioos.us
    • +3more
    Updated Sep 21, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rosemary Smith (2017). Fugro Cruise C16185 Line 1388, 75 kHz VMADCP [Dataset]. https://gcoos5.geos.tamu.edu/erddap/info/C16185_075_Line1388_0/index.html
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 21, 2017
    Dataset provided by
    Gulf of Mexico Coastal Ocean Observing System (GCOOS)
    Authors
    Rosemary Smith
    Time period covered
    Aug 21, 2006 - Aug 22, 2006
    Area covered
    Variables measured
    u, v, pg, amp, time, depth, pflag, uship, vship, heading, and 5 more
    Description

    Program of vessel mount ADCP measurements comprising a combination of 300kHz and 75kHz ADCP data collected in the vicinity of the Loop Current and drilling blocks between 2004 and 2007. _NCProperties=version=2,netcdf=4.7.4,hdf5=1.12.0, acknowledgement=Data collection funded by various oil industry operators cdm_data_type=TrajectoryProfile cdm_profile_variables=time cdm_trajectory_variables=trajectory CODAS_processing_note=

    CODAS processing note:

    Overview

    The CODAS database is a specialized storage format designed for shipboard ADCP data. "CODAS processing" uses this format to hold averaged shipboard ADCP velocities and other variables, during the stages of data processing. The CODAS database stores velocity profiles relative to the ship as east and north components along with position, ship speed, heading, and other variables. The netCDF short form contains ocean velocities relative to earth, time, position, transducer temperature, and ship heading; these are designed to be "ready for immediate use". The netCDF long form is just a dump of the entire CODAS database. Some variables are no longer used, and all have names derived from their original CODAS names, dating back to the late 1980's.

    Post-processing

    CODAS post-processing, i.e. that which occurs after the single-ping profiles have been vector-averaged and loaded into the CODAS database, includes editing (using automated algorithms and manual tools), rotation and scaling of the measured velocities, and application of a time-varying heading correction. Additional algorithms developed more recently include translation of the GPS positions to the transducer location, and averaging of ship's speed over the times of valid pings when Percent Good is reduced. Such post-processing is needed prior to submission of "processed ADCP data" to JASADCP or other archives.

    Full CODAS processing

    Whenever single-ping data have been recorded, full CODAS processing provides the best end product.

    Full CODAS processing starts with the single-ping velocities in beam coordinates. Based on the transducer orientation relative to the hull, the beam velocities are transformed to horizontal, vertical, and "error velocity" components. Using a reliable heading (typically from the ship's gyro compass), the velocities in ship coordinates are rotated into earth coordinates.

    Pings are grouped into an "ensemble" (usually 2-5 minutes duration) and undergo a suite of automated editing algorithms (removal of acoustic interference; identification of the bottom; editing based on thresholds; and specialized editing that targets CTD wire interference and "weak, biased profiles". The ensemble of single-ping velocities is then averaged using an iterative reference layer averaging scheme. Each ensemble is approximated as a single function of depth, with a zero-average over a reference layer plus a reference layer velocity for each ping. Adding the average of the single-ping reference layer velocities to the function of depth yields the ensemble-average velocity profile. These averaged profiles, along with ancillary measurements, are written to disk, and subsequently loaded into the CODAS database. Everything after this stage is "post-processing".

    note (time):

    Time is stored in the database using UTC Year, Month, Day, Hour, Minute, Seconds. Floating point time "Decimal Day" is the floating point interval in days since the start of the year, usually the year of the first day of the cruise.

    note (heading):

    CODAS processing uses heading from a reliable device, and (if available) uses a time-dependent correction by an accurate heading device. The reliable heading device is typically a gyro compass (for example, the Bridge gyro). Accurate heading devices can be POSMV, Seapath, Phins, Hydrins, MAHRS, or various Ashtech devices; this varies with the technology of the time. It is always confusing to keep track of the sign of the heading correction. Headings are written degrees, positive clockwise. setting up some variables:

    X = transducer angle (CONFIG1_heading_bias) positive clockwise (beam 3 angle relative to ship) G = Reliable heading (gyrocompass) A = Accurate heading dh = G - A = time-dependent heading correction (ANCIL2_watrk_hd_misalign)

    Rotation of the measured velocities into the correct coordinate system amounts to (u+i*v)*(exp(i*theta)) where theta is the sum of the corrected heading and the transducer angle.

    theta = X + (G - dh) = X + G - dh

    Watertrack and Bottomtrack calibrations give an indication of the residual angle offset to apply, for example if mean and median of the phase are all 0.5 (then R=0.5). Using the "rotate" command, the value of R is added to "ANCIL2_watrk_hd_misalign".

    new_dh = dh + R

    Therefore the total angle used in rotation is

    new_theta = X + G - dh_new = X + G - (dh + R) = (X - R) + (G - dh)

    The new estimate of the transducer angle is: X - R ANCIL2_watrk_hd_misalign contains: dh + R

    ====================================================

    Profile flags

    Profile editing flags are provided for each depth cell:

    binary decimal below Percent value value bottom Good bin -------+----------+--------+----------+-------+ 000 0 001 1 bad 010 2 bad 011 3 bad bad 100 4 bad 101 5 bad bad 110 6 bad bad 111 7 bad bad bad -------+----------+--------+----------+-------+

    CODAS_variables= Variables in this CODAS short-form Netcdf file are intended for most end-user scientific analysis and display purposes. For additional information see the CODAS_processing_note global attribute and the attributes of each of the variables.

    ============= ================================================================= time Time at the end of the ensemble, days from start of year. lon, lat Longitude, Latitude from GPS at the end of the ensemble. u,v Ocean zonal and meridional velocity component profiles. uship, vship Zonal and meridional velocity components of the ship. heading Mean ship heading during the ensemble. depth Bin centers in nominal meters (no sound speed profile correction). tr_temp ADCP transducer temperature. pg Percent Good pings for u, v averaging after editing. pflag Profile Flags based on editing, used to mask u, v. amp Received signal strength in ADCP-specific units; no correction for spreading or attenuation. ============= =================================================================

    contributor_name=RPS contributor_role=editor contributor_role_vocabulary=https://vocab.nerc.ac.uk/collection/G04/current/ Conventions=CF-1.6, ACDD-1.3, IOOS Metadata Profile Version 1.2, COARDS cruise_id=Fugro_wh75 description=Shipboard ADCP velocity profiles from Fugro_wh75 using instrument wh75 Easternmost_Easting=-91.18131944444445 featureType=TrajectoryProfile geospatial_bounds=LINESTRING (-91.47516388888891 26.999433333333332, -91.18131944444445 27.000597222222222) geospatial_bounds_crs=EPSG:4326 geospatial_bounds_vertical_crs=EPSG:5703 geospatial_lat_max=27.000597222222222 geospatial_lat_min=26.999433333333332 geospatial_lat_units=degrees_north geospatial_lon_max=-91.18131944444445 geospatial_lon_min=-91.47516388888891 geospatial_lon_units=degrees_east geospatial_vertical_max=600.76 geospatial_vertical_min=16.76 geospatial_vertical_positive=down geospatial_vertical_units=m hg_changeset=2924:48293b7d29a9 history=Created: 2019-07-15 17:48:18 UTC id=C16185_075_Line1388_0 infoUrl=ADD ME institution=GCOOS instrument=In Situ/Laboratory Instruments > Profilers/Sounders > Acoustic Sounders > ADCP > Acoustic Doppler Current Profiler keywords_vocabulary=GCMD Science Keywords naming_authority=edu.tamucc.gulfhub Northernmost_Northing=27.000597222222222 platform=ship platform_vocabulary=https://mmisw.org/ont/ioos/platform processing_level=QA'ed and checked by Oceanographer program=Oil and Gas Loop Current VMADCP Program project=O&G LC VMADCP Program software=pycurrents sonar=wh75 source=Current profiler sourceUrl=(local files) Southernmost_Northing=26.999433333333332 standard_name_vocabulary=CF Standard Name Table v67 subsetVariables=time, longitude, latitude, depth, u, v time_coverage_duration=P0Y0M0DT3H34M46S time_coverage_end=2006-08-22T01:10:58Z time_coverage_resolution=P0Y0M0DT0H4M59S time_coverage_start=2006-08-21T21:36:12Z Westernmost_Easting=-91.47516388888891 yearbase=2006

  13. Data from: arXiv Dataset

    • kaggle.com
    • huggingface.co
    • +1more
    zip
    Updated Nov 22, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Cornell University (2020). arXiv Dataset [Dataset]. https://www.kaggle.com/Cornell-University/arxiv
    Explore at:
    zip(950178574 bytes)Available download formats
    Dataset updated
    Nov 22, 2020
    Dataset authored and provided by
    Cornell University
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    About ArXiv

    For nearly 30 years, ArXiv has served the public and research communities by providing open access to scholarly articles, from the vast branches of physics to the many subdisciplines of computer science to everything in between, including math, statistics, electrical engineering, quantitative biology, and economics. This rich corpus of information offers significant, but sometimes overwhelming depth.

    In these times of unique global challenges, efficient extraction of insights from data is essential. To help make the arXiv more accessible, we present a free, open pipeline on Kaggle to the machine-readable arXiv dataset: a repository of 1.7 million articles, with relevant features such as article titles, authors, categories, abstracts, full text PDFs, and more.

    Our hope is to empower new use cases that can lead to the exploration of richer machine learning techniques that combine multi-modal features towards applications like trend analysis, paper recommender engines, category prediction, co-citation networks, knowledge graph construction and semantic search interfaces.

    The dataset is freely available via Google Cloud Storage buckets (more info here). Stay tuned for weekly updates to the dataset!

    ArXiv is a collaboratively funded, community-supported resource founded by Paul Ginsparg in 1991 and maintained and operated by Cornell University.

    The release of this dataset was featured further in a Kaggle blog post here.

    https://storage.googleapis.com/kaggle-public-downloads/arXiv.JPG" alt="">

    See here for more information.

    ArXiv On Kaggle

    Metadata

    This dataset is a mirror of the original ArXiv data. Because the full dataset is rather large (1.1TB and growing), this dataset provides only a metadata file in the json format. This file contains an entry for each paper, containing: - id: ArXiv ID (can be used to access the paper, see below) - submitter: Who submitted the paper - authors: Authors of the paper - title: Title of the paper - comments: Additional info, such as number of pages and figures - journal-ref: Information about the journal the paper was published in - doi: https://www.doi.org - abstract: The abstract of the paper - categories: Categories / tags in the ArXiv system - versions: A version history

    You can access each paper directly on ArXiv using these links: - https://arxiv.org/abs/{id}: Page for this paper including its abstract and further links - https://arxiv.org/pdf/{id}: Direct link to download the PDF

    Bulk access

    The full set of PDFs is available for free in the GCS bucket gs://arxiv-dataset or through Google API (json documentation and xml documentation).

    You can use for example gsutil to download the data to your local machine. ```

    List files:

    gsutil cp gs://arxiv-dataset/arxiv/

    Download pdfs from March 2020:

    gsutil cp gs://arxiv-dataset/arxiv/arxiv/pdf/2003/ ./a_local_directory/

    Download all the source files

    gsutil cp -r gs://arxiv-dataset/arxiv/ ./a_local_directory/ ```

    Update Frequency

    We're automatically updating the metadata as well as the GCS bucket on a weekly basis.

    License

    Creative Commons CC0 1.0 Universal Public Domain Dedication applies to the metadata in this dataset. See https://arxiv.org/help/license for further details and licensing on individual papers.

    Acknowledgements

    The original data is maintained by ArXiv, huge thanks to the team for building and maintaining this dataset.

    We're using https://github.com/mattbierbaum/arxiv-public-datasets to pull the original data, thanks to Matt Bierbaum for providing this tool.

  14. E

    Fugro Cruise C16185 Line 1388, 300 kHz VMADCP

    • gcoos5.geos.tamu.edu
    • data.ioos.us
    • +1more
    Updated Sep 21, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rosemary Smith (2017). Fugro Cruise C16185 Line 1388, 300 kHz VMADCP [Dataset]. https://gcoos5.geos.tamu.edu/erddap/info/C16185_300_Line1388_0/index.html
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 21, 2017
    Dataset provided by
    Gulf of Mexico Coastal Ocean Observing System (GCOOS)
    Authors
    Rosemary Smith
    Time period covered
    Aug 21, 2006 - Aug 22, 2006
    Area covered
    Variables measured
    u, v, pg, amp, time, depth, pflag, uship, vship, heading, and 5 more
    Description

    Program of vessel mount ADCP measurements comprising a combination of 300kHz and 75kHz ADCP data collected in the vicinity of the Loop Current and drilling blocks between 2004 and 2007. _NCProperties=version=2,netcdf=4.7.4,hdf5=1.12.0, acknowledgement=Data collection funded by various oil industry operators cdm_data_type=TrajectoryProfile cdm_profile_variables=time cdm_trajectory_variables=trajectory CODAS_processing_note=

    CODAS processing note:

    Overview

    The CODAS database is a specialized storage format designed for shipboard ADCP data. "CODAS processing" uses this format to hold averaged shipboard ADCP velocities and other variables, during the stages of data processing. The CODAS database stores velocity profiles relative to the ship as east and north components along with position, ship speed, heading, and other variables. The netCDF short form contains ocean velocities relative to earth, time, position, transducer temperature, and ship heading; these are designed to be "ready for immediate use". The netCDF long form is just a dump of the entire CODAS database. Some variables are no longer used, and all have names derived from their original CODAS names, dating back to the late 1980's.

    Post-processing

    CODAS post-processing, i.e. that which occurs after the single-ping profiles have been vector-averaged and loaded into the CODAS database, includes editing (using automated algorithms and manual tools), rotation and scaling of the measured velocities, and application of a time-varying heading correction. Additional algorithms developed more recently include translation of the GPS positions to the transducer location, and averaging of ship's speed over the times of valid pings when Percent Good is reduced. Such post-processing is needed prior to submission of "processed ADCP data" to JASADCP or other archives.

    Full CODAS processing

    Whenever single-ping data have been recorded, full CODAS processing provides the best end product.

    Full CODAS processing starts with the single-ping velocities in beam coordinates. Based on the transducer orientation relative to the hull, the beam velocities are transformed to horizontal, vertical, and "error velocity" components. Using a reliable heading (typically from the ship's gyro compass), the velocities in ship coordinates are rotated into earth coordinates.

    Pings are grouped into an "ensemble" (usually 2-5 minutes duration) and undergo a suite of automated editing algorithms (removal of acoustic interference; identification of the bottom; editing based on thresholds; and specialized editing that targets CTD wire interference and "weak, biased profiles". The ensemble of single-ping velocities is then averaged using an iterative reference layer averaging scheme. Each ensemble is approximated as a single function of depth, with a zero-average over a reference layer plus a reference layer velocity for each ping. Adding the average of the single-ping reference layer velocities to the function of depth yields the ensemble-average velocity profile. These averaged profiles, along with ancillary measurements, are written to disk, and subsequently loaded into the CODAS database. Everything after this stage is "post-processing".

    note (time):

    Time is stored in the database using UTC Year, Month, Day, Hour, Minute, Seconds. Floating point time "Decimal Day" is the floating point interval in days since the start of the year, usually the year of the first day of the cruise.

    note (heading):

    CODAS processing uses heading from a reliable device, and (if available) uses a time-dependent correction by an accurate heading device. The reliable heading device is typically a gyro compass (for example, the Bridge gyro). Accurate heading devices can be POSMV, Seapath, Phins, Hydrins, MAHRS, or various Ashtech devices; this varies with the technology of the time. It is always confusing to keep track of the sign of the heading correction. Headings are written degrees, positive clockwise. setting up some variables:

    X = transducer angle (CONFIG1_heading_bias) positive clockwise (beam 3 angle relative to ship) G = Reliable heading (gyrocompass) A = Accurate heading dh = G - A = time-dependent heading correction (ANCIL2_watrk_hd_misalign)

    Rotation of the measured velocities into the correct coordinate system amounts to (u+i*v)*(exp(i*theta)) where theta is the sum of the corrected heading and the transducer angle.

    theta = X + (G - dh) = X + G - dh

    Watertrack and Bottomtrack calibrations give an indication of the residual angle offset to apply, for example if mean and median of the phase are all 0.5 (then R=0.5). Using the "rotate" command, the value of R is added to "ANCIL2_watrk_hd_misalign".

    new_dh = dh + R

    Therefore the total angle used in rotation is

    new_theta = X + G - dh_new = X + G - (dh + R) = (X - R) + (G - dh)

    The new estimate of the transducer angle is: X - R ANCIL2_watrk_hd_misalign contains: dh + R

    ====================================================

    Profile flags

    Profile editing flags are provided for each depth cell:

    binary decimal below Percent value value bottom Good bin -------+----------+--------+----------+-------+ 000 0 001 1 bad 010 2 bad 011 3 bad bad 100 4 bad 101 5 bad bad 110 6 bad bad 111 7 bad bad bad -------+----------+--------+----------+-------+

    CODAS_variables= Variables in this CODAS short-form Netcdf file are intended for most end-user scientific analysis and display purposes. For additional information see the CODAS_processing_note global attribute and the attributes of each of the variables.

    ============= ================================================================= time Time at the end of the ensemble, days from start of year. lon, lat Longitude, Latitude from GPS at the end of the ensemble. u,v Ocean zonal and meridional velocity component profiles. uship, vship Zonal and meridional velocity components of the ship. heading Mean ship heading during the ensemble. depth Bin centers in nominal meters (no sound speed profile correction). tr_temp ADCP transducer temperature. pg Percent Good pings for u, v averaging after editing. pflag Profile Flags based on editing, used to mask u, v. amp Received signal strength in ADCP-specific units; no correction for spreading or attenuation. ============= =================================================================

    contributor_name=RPS contributor_role=editor contributor_role_vocabulary=https://vocab.nerc.ac.uk/collection/G04/current/ Conventions=CF-1.6, ACDD-1.3, IOOS Metadata Profile Version 1.2, COARDS cruise_id=Fugro_wh300 description=Shipboard ADCP velocity profiles from Fugro_wh300 using instrument wh300 Easternmost_Easting=-91.18141111111112 featureType=TrajectoryProfile geospatial_bounds=LINESTRING (-91.47513888888886 26.999433333333332, -91.18141111111112 27.0006) geospatial_bounds_crs=EPSG:4326 geospatial_bounds_vertical_crs=EPSG:5703 geospatial_lat_max=27.0006 geospatial_lat_min=26.999433333333332 geospatial_lat_units=degrees_north geospatial_lon_max=-91.18141111111112 geospatial_lon_min=-91.47513888888886 geospatial_lon_units=degrees_east geospatial_vertical_max=123.42 geospatial_vertical_min=7.42 geospatial_vertical_positive=down geospatial_vertical_units=m hg_changeset=2924:48293b7d29a9 history=Created: 2019-07-15 17:45:44 UTC id=C16185_300_Line1388_0 infoUrl=ADD ME institution=GCOOS instrument=In Situ/Laboratory Instruments > Profilers/Sounders > Acoustic Sounders > ADCP > Acoustic Doppler Current Profiler keywords_vocabulary=GCMD Science Keywords naming_authority=edu.tamucc.gulfhub Northernmost_Northing=27.0006 platform=ship platform_vocabulary=https://mmisw.org/ont/ioos/platform processing_level=QA'ed and checked by Oceanographer program=Oil and Gas Loop Current VMADCP Program project=O&G LC VMADCP Program software=pycurrents sonar=wh300 source=Current profiler sourceUrl=(local files) Southernmost_Northing=26.999433333333332 standard_name_vocabulary=CF Standard Name Table v67 subsetVariables=time, longitude, latitude, depth, u, v time_coverage_duration=P0Y0M0DT3H34M42S time_coverage_end=2006-08-22T01:10:57Z time_coverage_resolution=P0Y0M0DT0H5M0S time_coverage_start=2006-08-21T21:36:15Z Westernmost_Easting=-91.47513888888886 yearbase=2006

  15. MIT-BIH Arrhythmia Database (Simple CSVs)

    • kaggle.com
    zip
    Updated Jul 20, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Proto Bioengineering (2025). MIT-BIH Arrhythmia Database (Simple CSVs) [Dataset]. https://www.kaggle.com/datasets/protobioengineering/mit-bih-arrhythmia-database-modern-2023
    Explore at:
    zip(241764502 bytes)Available download formats
    Dataset updated
    Jul 20, 2025
    Authors
    Proto Bioengineering
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    A beginner-friendly version of the MIT-BIH Arrhythmia Database, which contains 48 electrocardiograms (EKGs) from 47 patients that were at Beth Israel Deaconess Medical Center in Boston, MA in 1975-1979.

    Update (7/18/2025)

    This data was updated to a new format on 7/18/2025 with new filenames. Now heartbeats are labeled and their annotations are in new CSV and JSON files. This means that each patient's EKG file is now named {id}_ekg.csv and they have accompanying heartbeat annotation files, named {id}_annotations.csv. For example, if your code used to open 100.csv, it should be changed to opening 100_ekg.csv.

    Filenames

    Each of the 48 EKGs has the following files (using patient 100 as an example): - 100_ekg.csv - a 30-minute EKG recording from one patient with 2 EKG channels. This also contains annotations (the symbol column), where doctors have marked and classified heartbeats as normal or abnormal. - 100_ekg.json - the 30-minute EKG with all of its metadata. It has all of the same data as the CSV file in addition to frequency/sample rate info and more. - 100_annotations.csv - the labels for the heartbeats, where doctors have manually classified each heartbeat as normal as one of dozens of types of arrhythmias. There may be multiple of these files (number 1, 2, or 3), since the original MIT-BIH Arrhythmia Database had multiple .atr files for some patients. The MIT-BIH DB did not elaborate on why, though the differences between each annotation file seems to be only a few lines at most. - 100_annotations.json - the annotation file that is as close to the original as possible, keeping all of its metadata, while being an easy to use JSON file (as opposed to an .atr file, which requires the WFDB library to open).

    Other files: - annotation_symbols.csv - contains the meanings of the annotation symbols

    There are 48 EKGs for 47 patients, each of which is a 30-minute echocardiogram (EKG) from a single patient. (Record 201 and 202 are from the same patient). Data was collected at 360 Hz, meaning that 360 data points is equal to 1 second of time.

    Each file's name starts with the ID of the patient (except for 201 and 202, which are the same person).

    Related Data

    The P-waves were labeled by doctors and technicians, and their exact indices are available in the accompanying dataset, MIT-BIH Arrhythmia Database P-wave Annotations.

    How to Analyze the Heart with Python

    1. How to Analyze Heartbeats in 15 Minutes with Python
    2. How the Heart Works (and What is a "QRS" Complex?)
    3. How to Identify and Label the Waves of an EKG
    4. How to Flatten a Wandering EKG
    5. How to Calculate the Heart Rate

    What is a 12-lead EKG?

    EKGs, or electrocardiograms, measure the heart's function by looking at its electrical activity. The electrical activity in each part of the heart is supposed to happen in a particular order and intensity, creating that classic "heartbeat" line (or "QRS complex") you see on monitors in medical TV shows.

    There are a few types of EKGs (4-lead, 5-lead, 12-lead, etc.), which give us varying detail about the heart. A 12-lead is one of the most detailed types of EKGs, as it allows us to get 12 different outputs or graphs, all looking at different, specific parts of the heart muscles.

    This dataset only publishes two leads from each patient's 12-lead EKG, since that is all that the original MIT-BIH database provided.

    What does each part of the QRS complex mean?

    Check out Ninja Nerd's EKG Basics tutorial on YouTube to understand what each part of the QRS complex (or heartbeat) means from an electrical standpoint.

    Columns

    • index
    • the first lead
    • the second lead

    The two leads are often lead MLII and another lead such as V1, V2, or V5, though some datasets do not use MLII at all. MLII is the lead most often associated with the classic QRS Complex (the medical name for a single heartbeat).

    Patient information

    Info about [each of the 47 patients is available here](https://physionet.org/phys...

  16. Mexico COVID-19 clinical data

    • kaggle.com
    zip
    Updated Jun 5, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mariana R Franklin (2020). Mexico COVID-19 clinical data [Dataset]. https://www.kaggle.com/datasets/marianarfranklin/mexico-covid19-clinical-data/code
    Explore at:
    zip(6399963 bytes)Available download formats
    Dataset updated
    Jun 5, 2020
    Authors
    Mariana R Franklin
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Area covered
    Mexico
    Description

    Mexico COVID-19 clinical data 🦠🇲🇽

    This dataset contains the results of real-time PCR testing for COVID-19 in Mexico as reported by the [General Directorate of Epidemiology](https://www.gob.mx/salud/documentos/datos-abiertos-152127).

    The official, raw dataset is available in the Official Secretary of Epidemiology website: https://www.gob.mx/salud/documentos/datos-abiertos-152127.

    You might also want to download the official column descriptors and the variable definitions - e.g. SEXO=1 -> Female; SEXO=2 -> Male; SEXO=99 -> Undisclosed) - in the following [zip file](http://datosabiertos.salud.gob.mx/gobmx/salud/datos_abiertos/diccionario_datos_covid19.zip). I've maintained the original levels as described in the official dataset, unless otherwise specified.

    IMPORTANT: This dataset has been maintained since the original data releases, which weren't tabular, but rather consisted of PDF files, often with many/different inconsistencies which had to be resolved carefully and is annotated in the .R script. More later datasets should be more reliable, but earlier there were a lot of things to figure out like e.g. when the official methodology to assign the region of the case was changed to be based on residence rather than origin). I've added more notes on very early data here: https://github.com/marianarf/covid19_mexico_data.

    [More official information here](https://datos.gob.mx/busca/dataset/informacion-referente-a-casos-covid-19-en-mexico/resource/e8c7079c-dc2a-4b6e-8035-08042ed37165).

    Motivation

    I hope that this data serves to as a base to understand the clinical symptoms 🔬that characterize a COVID-19 positive case from another viral respiratory disease and help expand the knowledge about COVID-19 worldwide.

    👩‍🔬🧑‍🔬🧪

    With more models tested, added features and fine-tuning, clinical data could be used to predict a patient with pending COVID-19 results will get a positive or a negative result in two scenarios:

    • As lab results are processed, this leaves a window when it's uncertain whether a result will return positive or negative (this is merely didactic, as new reports will corroborate the prediction as soon as the laboratory data for missing cases is reported).
    • More importantly, it could help predict for similar symptoms e.g. from a survey or an app that checks for similar data (ideally, containing most of the parameters that can be assessed without using variables only available after hospitalization, like e.g. age of the person which is readily available).

    The value of the lab result comes from a RT-PCR, and is stored in RESULTADO, where the original data is encoded 1 = POSITIVE and 2 = NEGATIVE.

    Source

    The data was gathered using a "sentinel model" that samples 10% of the patients that present a viral respiratory diagnosis to test for COVID-19, and consists of data reported by 475 viral respiratory disease monitoring units (hospitals) named USMER (Unidades Monitoras de Enfermedad Respiratoria Viral) throughout the country in the entire health sector (IMSS, ISSSTE, SEDENA, SEMAR, and others).

    Preprocess

    Data is first processed with this [this .R script](https://github.com/marianarf/covid19_mexico_analysis/blob/master/notebooks/preprocess.R). The file containing the processed data will be updated daily until. Important: Since the data is updated to Github, assume the data uploaded here isn't the latest version, and instead, load data directly from the 'csv' [in this github repository](https://raw.githubusercontent.com/marianarf/covid19_mexico_analysis/master/mexico_covid19.csv).

    • The data aggregates official daily reports of patients admitted in COVID-19 designated units.
    • New cases are usually concatenated at the end of the file, but each individual case also contains a unique (official) identifier 'ID_REGISTRO' as well as a (new) unique reference 'id' to remove duplicates.
    • I fixed a specific change in methodology in reporting, where the patient record used to be assigned in ENTIDAD_UM (the region of the medical unit) but now uses ENTIDAD_RES (the region of residence of the patient).
    Note: I have preserved the original structure (column names and factors) as closely as possible to the official data, so that code is reproducible in cross-reference to the official sources. ### Added features

    In addition to original features reported, I've included missing regional names and also a field 'DELAY' which corresponds to the lag in the processing lab results (since new data contains records from the previous day, this allows to keep track of this lag).

    Additional info

    ...

  17. E

    Fugro Cruise C16185 Line 1054, 75 kHz VMADCP

    • gcoos5.geos.tamu.edu
    • data.ioos.us
    • +2more
    Updated Sep 21, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rosemary Smith (2017). Fugro Cruise C16185 Line 1054, 75 kHz VMADCP [Dataset]. https://gcoos5.geos.tamu.edu/erddap/info/C16185_075_Line1054_0/index.html
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 21, 2017
    Dataset provided by
    Gulf of Mexico Coastal Ocean Observing System (GCOOS)
    Authors
    Rosemary Smith
    Time period covered
    May 25, 2006
    Area covered
    Variables measured
    u, v, pg, amp, time, depth, pflag, uship, vship, heading, and 5 more
    Description

    Program of vessel mount ADCP measurements comprising a combination of 300kHz and 75kHz ADCP data collected in the vicinity of the Loop Current and drilling blocks between 2004 and 2007. _NCProperties=version=2,netcdf=4.7.4,hdf5=1.12.0, acknowledgement=Data collection funded by various oil industry operators cdm_data_type=TrajectoryProfile cdm_profile_variables=time cdm_trajectory_variables=trajectory CODAS_processing_note=

    CODAS processing note:

    Overview

    The CODAS database is a specialized storage format designed for shipboard ADCP data. "CODAS processing" uses this format to hold averaged shipboard ADCP velocities and other variables, during the stages of data processing. The CODAS database stores velocity profiles relative to the ship as east and north components along with position, ship speed, heading, and other variables. The netCDF short form contains ocean velocities relative to earth, time, position, transducer temperature, and ship heading; these are designed to be "ready for immediate use". The netCDF long form is just a dump of the entire CODAS database. Some variables are no longer used, and all have names derived from their original CODAS names, dating back to the late 1980's.

    Post-processing

    CODAS post-processing, i.e. that which occurs after the single-ping profiles have been vector-averaged and loaded into the CODAS database, includes editing (using automated algorithms and manual tools), rotation and scaling of the measured velocities, and application of a time-varying heading correction. Additional algorithms developed more recently include translation of the GPS positions to the transducer location, and averaging of ship's speed over the times of valid pings when Percent Good is reduced. Such post-processing is needed prior to submission of "processed ADCP data" to JASADCP or other archives.

    Full CODAS processing

    Whenever single-ping data have been recorded, full CODAS processing provides the best end product.

    Full CODAS processing starts with the single-ping velocities in beam coordinates. Based on the transducer orientation relative to the hull, the beam velocities are transformed to horizontal, vertical, and "error velocity" components. Using a reliable heading (typically from the ship's gyro compass), the velocities in ship coordinates are rotated into earth coordinates.

    Pings are grouped into an "ensemble" (usually 2-5 minutes duration) and undergo a suite of automated editing algorithms (removal of acoustic interference; identification of the bottom; editing based on thresholds; and specialized editing that targets CTD wire interference and "weak, biased profiles". The ensemble of single-ping velocities is then averaged using an iterative reference layer averaging scheme. Each ensemble is approximated as a single function of depth, with a zero-average over a reference layer plus a reference layer velocity for each ping. Adding the average of the single-ping reference layer velocities to the function of depth yields the ensemble-average velocity profile. These averaged profiles, along with ancillary measurements, are written to disk, and subsequently loaded into the CODAS database. Everything after this stage is "post-processing".

    note (time):

    Time is stored in the database using UTC Year, Month, Day, Hour, Minute, Seconds. Floating point time "Decimal Day" is the floating point interval in days since the start of the year, usually the year of the first day of the cruise.

    note (heading):

    CODAS processing uses heading from a reliable device, and (if available) uses a time-dependent correction by an accurate heading device. The reliable heading device is typically a gyro compass (for example, the Bridge gyro). Accurate heading devices can be POSMV, Seapath, Phins, Hydrins, MAHRS, or various Ashtech devices; this varies with the technology of the time. It is always confusing to keep track of the sign of the heading correction. Headings are written degrees, positive clockwise. setting up some variables:

    X = transducer angle (CONFIG1_heading_bias) positive clockwise (beam 3 angle relative to ship) G = Reliable heading (gyrocompass) A = Accurate heading dh = G - A = time-dependent heading correction (ANCIL2_watrk_hd_misalign)

    Rotation of the measured velocities into the correct coordinate system amounts to (u+i*v)*(exp(i*theta)) where theta is the sum of the corrected heading and the transducer angle.

    theta = X + (G - dh) = X + G - dh

    Watertrack and Bottomtrack calibrations give an indication of the residual angle offset to apply, for example if mean and median of the phase are all 0.5 (then R=0.5). Using the "rotate" command, the value of R is added to "ANCIL2_watrk_hd_misalign".

    new_dh = dh + R

    Therefore the total angle used in rotation is

    new_theta = X + G - dh_new = X + G - (dh + R) = (X - R) + (G - dh)

    The new estimate of the transducer angle is: X - R ANCIL2_watrk_hd_misalign contains: dh + R

    ====================================================

    Profile flags

    Profile editing flags are provided for each depth cell:

    binary decimal below Percent value value bottom Good bin -------+----------+--------+----------+-------+ 000 0 001 1 bad 010 2 bad 011 3 bad bad 100 4 bad 101 5 bad bad 110 6 bad bad 111 7 bad bad bad -------+----------+--------+----------+-------+

    CODAS_variables= Variables in this CODAS short-form Netcdf file are intended for most end-user scientific analysis and display purposes. For additional information see the CODAS_processing_note global attribute and the attributes of each of the variables.

    ============= ================================================================= time Time at the end of the ensemble, days from start of year. lon, lat Longitude, Latitude from GPS at the end of the ensemble. u,v Ocean zonal and meridional velocity component profiles. uship, vship Zonal and meridional velocity components of the ship. heading Mean ship heading during the ensemble. depth Bin centers in nominal meters (no sound speed profile correction). tr_temp ADCP transducer temperature. pg Percent Good pings for u, v averaging after editing. pflag Profile Flags based on editing, used to mask u, v. amp Received signal strength in ADCP-specific units; no correction for spreading or attenuation. ============= =================================================================

    contributor_name=RPS contributor_role=editor contributor_role_vocabulary=https://vocab.nerc.ac.uk/collection/G04/current/ Conventions=CF-1.6, ACDD-1.3, IOOS Metadata Profile Version 1.2, COARDS cruise_id=Fugro_wh75 description=Shipboard ADCP velocity profiles from Fugro_wh75 using instrument wh75 Easternmost_Easting=-89.82341944444443 featureType=TrajectoryProfile geospatial_bounds=LINESTRING (-89.94420555555558 27.253375, -89.82341944444443 27.25493888888889) geospatial_bounds_crs=EPSG:4326 geospatial_bounds_vertical_crs=EPSG:5703 geospatial_lat_max=27.25493888888889 geospatial_lat_min=27.253375 geospatial_lat_units=degrees_north geospatial_lon_max=-89.82341944444443 geospatial_lon_min=-89.94420555555558 geospatial_lon_units=degrees_east geospatial_vertical_max=651.83 geospatial_vertical_min=27.83 geospatial_vertical_positive=down geospatial_vertical_units=m hg_changeset=2924:48293b7d29a9 history=Created: 2019-07-15 17:47:36 UTC id=C16185_075_Line1054_0 infoUrl=ADD ME institution=GCOOS instrument=In Situ/Laboratory Instruments > Profilers/Sounders > Acoustic Sounders > ADCP > Acoustic Doppler Current Profiler keywords_vocabulary=GCMD Science Keywords naming_authority=edu.tamucc.gulfhub Northernmost_Northing=27.25493888888889 platform=ship platform_vocabulary=https://mmisw.org/ont/ioos/platform processing_level=QA'ed and checked by Oceanographer program=Oil and Gas Loop Current VMADCP Program project=O&G LC VMADCP Program software=pycurrents sonar=wh75 source=Current profiler sourceUrl=(local files) Southernmost_Northing=27.253375 standard_name_vocabulary=CF Standard Name Table v67 subsetVariables=time, longitude, latitude, depth, u, v time_coverage_duration=P0Y0M0DT1H16M8S time_coverage_end=2006-05-25T12:05:45Z time_coverage_resolution=P0Y0M0DT0H4M59S time_coverage_start=2006-05-25T10:49:37Z Westernmost_Easting=-89.94420555555558 yearbase=2006

  18. Telco Customer Churn

    • kaggle.com
    zip
    Updated Feb 23, 2018
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    BlastChar (2018). Telco Customer Churn [Dataset]. https://www.kaggle.com/datasets/blastchar/telco-customer-churn
    Explore at:
    zip(175758 bytes)Available download formats
    Dataset updated
    Feb 23, 2018
    Authors
    BlastChar
    Description

    Context

    "Predict behavior to retain customers. You can analyze all relevant customer data and develop focused customer retention programs." [IBM Sample Data Sets]

    Content

    Each row represents a customer, each column contains customer’s attributes described on the column Metadata.

    The data set includes information about:

    • Customers who left within the last month – the column is called Churn
    • Services that each customer has signed up for – phone, multiple lines, internet, online security, online backup, device protection, tech support, and streaming TV and movies
    • Customer account information – how long they’ve been a customer, contract, payment method, paperless billing, monthly charges, and total charges
    • Demographic info about customers – gender, age range, and if they have partners and dependents

    Inspiration

    To explore this type of models and learn more about the subject.

    New version from IBM: https://community.ibm.com/community/user/businessanalytics/blogs/steven-macko/2019/07/11/telco-customer-churn-1113

  19. E

    Fugro Cruise C16185 Line 1292, 300 kHz VMADCP

    • gcoos5.geos.tamu.edu
    • data.ioos.us
    • +2more
    Updated Sep 21, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rosemary Smith (2017). Fugro Cruise C16185 Line 1292, 300 kHz VMADCP [Dataset]. https://gcoos5.geos.tamu.edu/erddap/info/C16185_300_Line1292_0/index.html
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 21, 2017
    Dataset provided by
    Gulf of Mexico Coastal Ocean Observing System (GCOOS)
    Authors
    Rosemary Smith
    Time period covered
    Jul 15, 2006
    Area covered
    Variables measured
    u, v, pg, amp, time, depth, pflag, uship, vship, heading, and 5 more
    Description

    Program of vessel mount ADCP measurements comprising a combination of 300kHz and 75kHz ADCP data collected in the vicinity of the Loop Current and drilling blocks between 2004 and 2007. _NCProperties=version=2,netcdf=4.7.4,hdf5=1.12.0, acknowledgement=Data collection funded by various oil industry operators cdm_data_type=TrajectoryProfile cdm_profile_variables=time cdm_trajectory_variables=trajectory CODAS_processing_note=

    CODAS processing note:

    Overview

    The CODAS database is a specialized storage format designed for shipboard ADCP data. "CODAS processing" uses this format to hold averaged shipboard ADCP velocities and other variables, during the stages of data processing. The CODAS database stores velocity profiles relative to the ship as east and north components along with position, ship speed, heading, and other variables. The netCDF short form contains ocean velocities relative to earth, time, position, transducer temperature, and ship heading; these are designed to be "ready for immediate use". The netCDF long form is just a dump of the entire CODAS database. Some variables are no longer used, and all have names derived from their original CODAS names, dating back to the late 1980's.

    Post-processing

    CODAS post-processing, i.e. that which occurs after the single-ping profiles have been vector-averaged and loaded into the CODAS database, includes editing (using automated algorithms and manual tools), rotation and scaling of the measured velocities, and application of a time-varying heading correction. Additional algorithms developed more recently include translation of the GPS positions to the transducer location, and averaging of ship's speed over the times of valid pings when Percent Good is reduced. Such post-processing is needed prior to submission of "processed ADCP data" to JASADCP or other archives.

    Full CODAS processing

    Whenever single-ping data have been recorded, full CODAS processing provides the best end product.

    Full CODAS processing starts with the single-ping velocities in beam coordinates. Based on the transducer orientation relative to the hull, the beam velocities are transformed to horizontal, vertical, and "error velocity" components. Using a reliable heading (typically from the ship's gyro compass), the velocities in ship coordinates are rotated into earth coordinates.

    Pings are grouped into an "ensemble" (usually 2-5 minutes duration) and undergo a suite of automated editing algorithms (removal of acoustic interference; identification of the bottom; editing based on thresholds; and specialized editing that targets CTD wire interference and "weak, biased profiles". The ensemble of single-ping velocities is then averaged using an iterative reference layer averaging scheme. Each ensemble is approximated as a single function of depth, with a zero-average over a reference layer plus a reference layer velocity for each ping. Adding the average of the single-ping reference layer velocities to the function of depth yields the ensemble-average velocity profile. These averaged profiles, along with ancillary measurements, are written to disk, and subsequently loaded into the CODAS database. Everything after this stage is "post-processing".

    note (time):

    Time is stored in the database using UTC Year, Month, Day, Hour, Minute, Seconds. Floating point time "Decimal Day" is the floating point interval in days since the start of the year, usually the year of the first day of the cruise.

    note (heading):

    CODAS processing uses heading from a reliable device, and (if available) uses a time-dependent correction by an accurate heading device. The reliable heading device is typically a gyro compass (for example, the Bridge gyro). Accurate heading devices can be POSMV, Seapath, Phins, Hydrins, MAHRS, or various Ashtech devices; this varies with the technology of the time. It is always confusing to keep track of the sign of the heading correction. Headings are written degrees, positive clockwise. setting up some variables:

    X = transducer angle (CONFIG1_heading_bias) positive clockwise (beam 3 angle relative to ship) G = Reliable heading (gyrocompass) A = Accurate heading dh = G - A = time-dependent heading correction (ANCIL2_watrk_hd_misalign)

    Rotation of the measured velocities into the correct coordinate system amounts to (u+i*v)*(exp(i*theta)) where theta is the sum of the corrected heading and the transducer angle.

    theta = X + (G - dh) = X + G - dh

    Watertrack and Bottomtrack calibrations give an indication of the residual angle offset to apply, for example if mean and median of the phase are all 0.5 (then R=0.5). Using the "rotate" command, the value of R is added to "ANCIL2_watrk_hd_misalign".

    new_dh = dh + R

    Therefore the total angle used in rotation is

    new_theta = X + G - dh_new = X + G - (dh + R) = (X - R) + (G - dh)

    The new estimate of the transducer angle is: X - R ANCIL2_watrk_hd_misalign contains: dh + R

    ====================================================

    Profile flags

    Profile editing flags are provided for each depth cell:

    binary decimal below Percent value value bottom Good bin -------+----------+--------+----------+-------+ 000 0 001 1 bad 010 2 bad 011 3 bad bad 100 4 bad 101 5 bad bad 110 6 bad bad 111 7 bad bad bad -------+----------+--------+----------+-------+

    CODAS_variables= Variables in this CODAS short-form Netcdf file are intended for most end-user scientific analysis and display purposes. For additional information see the CODAS_processing_note global attribute and the attributes of each of the variables.

    ============= ================================================================= time Time at the end of the ensemble, days from start of year. lon, lat Longitude, Latitude from GPS at the end of the ensemble. u,v Ocean zonal and meridional velocity component profiles. uship, vship Zonal and meridional velocity components of the ship. heading Mean ship heading during the ensemble. depth Bin centers in nominal meters (no sound speed profile correction). tr_temp ADCP transducer temperature. pg Percent Good pings for u, v averaging after editing. pflag Profile Flags based on editing, used to mask u, v. amp Received signal strength in ADCP-specific units; no correction for spreading or attenuation. ============= =================================================================

    contributor_name=RPS contributor_role=editor contributor_role_vocabulary=https://vocab.nerc.ac.uk/collection/G04/current/ Conventions=CF-1.6, ACDD-1.3, IOOS Metadata Profile Version 1.2, COARDS cruise_id=Fugro_wh300 description=Shipboard ADCP velocity profiles from Fugro_wh300 using instrument wh300 Easternmost_Easting=-90.05254444444444 featureType=TrajectoryProfile geospatial_bounds=LINESTRING (-90.64675833333331 26.964925, -90.05254444444444 27.216625) geospatial_bounds_crs=EPSG:4326 geospatial_bounds_vertical_crs=EPSG:5703 geospatial_lat_max=27.216625 geospatial_lat_min=26.964925 geospatial_lat_units=degrees_north geospatial_lon_max=-90.05254444444444 geospatial_lon_min=-90.64675833333331 geospatial_lon_units=degrees_east geospatial_vertical_max=123.42 geospatial_vertical_min=7.42 geospatial_vertical_positive=down geospatial_vertical_units=m hg_changeset=2924:48293b7d29a9 history=Created: 2019-07-15 17:46:18 UTC id=C16185_300_Line1292_0 infoUrl=ADD ME institution=GCOOS instrument=In Situ/Laboratory Instruments > Profilers/Sounders > Acoustic Sounders > ADCP > Acoustic Doppler Current Profiler keywords_vocabulary=GCMD Science Keywords naming_authority=edu.tamucc.gulfhub Northernmost_Northing=27.216625 platform=ship platform_vocabulary=https://mmisw.org/ont/ioos/platform processing_level=QA'ed and checked by Oceanographer program=Oil and Gas Loop Current VMADCP Program project=O&G LC VMADCP Program software=pycurrents sonar=wh300 source=Current profiler sourceUrl=(local files) Southernmost_Northing=26.964925 standard_name_vocabulary=CF Standard Name Table v67 subsetVariables=time, longitude, latitude, depth, u, v time_coverage_duration=P0Y0M0DT7H10M17S time_coverage_end=2006-07-15T19:33:39Z time_coverage_resolution=P0Y0M0DT0H4M59S time_coverage_start=2006-07-15T12:23:22Z Westernmost_Easting=-90.64675833333331 yearbase=2006

  20. Synthetic Financial Datasets For Fraud Detection

    • kaggle.com
    zip
    Updated Apr 3, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Edgar Lopez-Rojas (2017). Synthetic Financial Datasets For Fraud Detection [Dataset]. https://www.kaggle.com/datasets/ealaxi/paysim1
    Explore at:
    zip(186385561 bytes)Available download formats
    Dataset updated
    Apr 3, 2017
    Authors
    Edgar Lopez-Rojas
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    Context

    There is a lack of public available datasets on financial services and specially in the emerging mobile money transactions domain. Financial datasets are important to many researchers and in particular to us performing research in the domain of fraud detection. Part of the problem is the intrinsically private nature of financial transactions, that leads to no publicly available datasets.

    We present a synthetic dataset generated using the simulator called PaySim as an approach to such a problem. PaySim uses aggregated data from the private dataset to generate a synthetic dataset that resembles the normal operation of transactions and injects malicious behaviour to later evaluate the performance of fraud detection methods.

    Content

    PaySim simulates mobile money transactions based on a sample of real transactions extracted from one month of financial logs from a mobile money service implemented in an African country. The original logs were provided by a multinational company, who is the provider of the mobile financial service which is currently running in more than 14 countries all around the world.

    This synthetic dataset is scaled down 1/4 of the original dataset and it is created just for Kaggle.

    NOTE: Transactions which are detected as fraud are cancelled, so for fraud detection these columns (oldbalanceOrg, newbalanceOrig, oldbalanceDest, newbalanceDest ) must not be used.

    Headers

    This is a sample of 1 row with headers explanation:

    1,PAYMENT,1060.31,C429214117,1089.0,28.69,M1591654462,0.0,0.0,0,0

    step - maps a unit of time in the real world. In this case 1 step is 1 hour of time. Total steps 744 (30 days simulation).

    type - CASH-IN, CASH-OUT, DEBIT, PAYMENT and TRANSFER.

    amount - amount of the transaction in local currency.

    nameOrig - customer who started the transaction

    oldbalanceOrg - initial balance before the transaction

    newbalanceOrig - new balance after the transaction.

    nameDest - customer who is the recipient of the transaction

    oldbalanceDest - initial balance recipient before the transaction. Note that there is not information for customers that start with M (Merchants).

    newbalanceDest - new balance recipient after the transaction. Note that there is not information for customers that start with M (Merchants).

    isFraud - This is the transactions made by the fraudulent agents inside the simulation. In this specific dataset the fraudulent behavior of the agents aims to profit by taking control or customers accounts and try to empty the funds by transferring to another account and then cashing out of the system.

    isFlaggedFraud - The business model aims to control massive transfers from one account to another and flags illegal attempts. An illegal attempt in this dataset is an attempt to transfer more than 200.000 in a single transaction.

    Past Research

    There are 5 similar files that contain the run of 5 different scenarios. These files are better explained at my PhD thesis chapter 7 (PhD Thesis Available here http://urn.kb.se/resolve?urn=urn:nbn:se:bth-12932.

    We ran PaySim several times using random seeds for 744 steps, representing each hour of one month of real time, which matches the original logs. Each run took around 45 minutes on an i7 intel processor with 16GB of RAM. The final result of a run contains approximately 24 million of financial records divided into the 5 types of categories: CASH-IN, CASH-OUT, DEBIT, PAYMENT and TRANSFER.

    Acknowledgements

    This work is part of the research project ”Scalable resource-efficient systems for big data analytics” funded by the Knowledge Foundation (grant: 20140032) in Sweden.

    Please refer to this dataset using the following citations:

    PaySim first paper of the simulator:

    E. A. Lopez-Rojas , A. Elmir, and S. Axelsson. "PaySim: A financial mobile money simulator for fraud detection". In: The 28th European Modeling and Simulation Symposium-EMSS, Larnaca, Cyprus. 2016

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
The Devastator (2022). Reddit: /r/technology (Submissions & Comments) [Dataset]. https://www.kaggle.com/datasets/thedevastator/uncovering-technology-insights-through-reddit-di
Organization logo

Reddit: /r/technology (Submissions & Comments)

Title, Score, ID, URL, Comment Number, and Timestamp

Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 18, 2022
Dataset provided by
Kaggle
Authors
The Devastator
License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Description

Reddit: /r/technology (Submissions & Comments)

Title, Score, ID, URL, Comment Number, and Timestamp

By Reddit [source]

About this dataset

This dataset, labeled as Reddit Technology Data, provides thorough insights into the conversations and interactions around technology-related topics shared on Reddit – a well-known Internet discussion forum. This dataset contains titles of discussions, scores as contributed by users on Reddit, the unique IDs attributed to different discussions, URLs of those hidden discussions (if any), comment counts in each discussion thread and timestamps of when those conversations were initiated. As such, this data is supremely valuable for tech-savvy people wanting to stay up to date with the new developments in their field or professionals looking to keep abreast with industry trends. In short, it is a repository which helps people make sense and draw meaning out of what’s happening in the technology world at large - inspiring action on their part or simply educating them about forthcoming changes

More Datasets

For more datasets, click here.

Featured Notebooks

  • 🚨 Your notebook can be here! 🚨!

How to use the dataset

The dataset includes six columns containing title, score, url address link to the discussion page on Reddit itself ,comment count ,created time stamp meaning when was it posted/uploaded/communicated and body containing actual text written regarding that particular post/discussion. By separately analyzing each column it can be made out what type information user require in regard with various aspects related to technology based discussions. One can develop hypothesis about correlations between different factors associated with rating or comment count by separate analysis within those columns themselves like discuss what does people comment or react mostly upon viewing which type of post inside reddit ? Does high rating always come along with extremely long comments.? And many more .By researching this way one can discover real facts hidden behind social networking websites such as reddit which contains large amount of rich information regarding user’s interest in different topics related to tech gadgets or otherwise .We can analyze different trends using voice search technology etc in order visualize users overall reaction towards any kind of information shared through public forums like stack overflow sites ,facebook posts etc .These small instances will allow us gain heavy insights for research purpose thereby providing another layer for potential business opportunities one may benefit from over a given period if not periodcally monitored .

Research Ideas

  • Companies can use this dataset to create targeted online marketing campaigns directed towards Reddit users interested in specific areas of technology.
  • Academic researchers can use the data to track and analyze trends in conversations related to technology on Reddit over time.
  • Technology professionals can utilize the comments and discussions on this dataset as a way of gauging public opinion and consumer sentiment towards certain technological advancements or products

Acknowledgements

If you use this dataset in your research, please credit the original authors. Data Source

License

License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

Columns

File: technology.csv | Column name | Description | |:--------------|:--------------------------------------------------------------------------| | title | The title of the discussion. (String) | | score | The score of the discussion as measured by Reddit contributors. (Integer) | | url | The website URL associated with the discussion. (String) | | comms_num | The number of comments associated with the discussion. (Integer) | | created | The date and time the discussion was created. (DateTime) | | body | The body content of the discussion. (String) | | timestamp | The timestamp of the discussion. (Integer) |

Acknowledgements

If you use this dataset in your research, please credit the original authors. If you use this dataset in your research, please credit Reddit.

Search
Clear search
Close search
Google apps
Main menu