Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
By Reddit [source]
This dataset, labeled as Reddit Technology Data, provides thorough insights into the conversations and interactions around technology-related topics shared on Reddit – a well-known Internet discussion forum. This dataset contains titles of discussions, scores as contributed by users on Reddit, the unique IDs attributed to different discussions, URLs of those hidden discussions (if any), comment counts in each discussion thread and timestamps of when those conversations were initiated. As such, this data is supremely valuable for tech-savvy people wanting to stay up to date with the new developments in their field or professionals looking to keep abreast with industry trends. In short, it is a repository which helps people make sense and draw meaning out of what’s happening in the technology world at large - inspiring action on their part or simply educating them about forthcoming changes
For more datasets, click here.
- 🚨 Your notebook can be here! 🚨!
The dataset includes six columns containing title, score, url address link to the discussion page on Reddit itself ,comment count ,created time stamp meaning when was it posted/uploaded/communicated and body containing actual text written regarding that particular post/discussion. By separately analyzing each column it can be made out what type information user require in regard with various aspects related to technology based discussions. One can develop hypothesis about correlations between different factors associated with rating or comment count by separate analysis within those columns themselves like discuss what does people comment or react mostly upon viewing which type of post inside reddit ? Does high rating always come along with extremely long comments.? And many more .By researching this way one can discover real facts hidden behind social networking websites such as reddit which contains large amount of rich information regarding user’s interest in different topics related to tech gadgets or otherwise .We can analyze different trends using voice search technology etc in order visualize users overall reaction towards any kind of information shared through public forums like stack overflow sites ,facebook posts etc .These small instances will allow us gain heavy insights for research purpose thereby providing another layer for potential business opportunities one may benefit from over a given period if not periodcally monitored .
- Companies can use this dataset to create targeted online marketing campaigns directed towards Reddit users interested in specific areas of technology.
- Academic researchers can use the data to track and analyze trends in conversations related to technology on Reddit over time.
- Technology professionals can utilize the comments and discussions on this dataset as a way of gauging public opinion and consumer sentiment towards certain technological advancements or products
If you use this dataset in your research, please credit the original authors. Data Source
License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.
File: technology.csv | Column name | Description | |:--------------|:--------------------------------------------------------------------------| | title | The title of the discussion. (String) | | score | The score of the discussion as measured by Reddit contributors. (Integer) | | url | The website URL associated with the discussion. (String) | | comms_num | The number of comments associated with the discussion. (Integer) | | created | The date and time the discussion was created. (DateTime) | | body | The body content of the discussion. (String) | | timestamp | The timestamp of the discussion. (Integer) |
If you use this dataset in your research, please credit the original authors. If you use this dataset in your research, please credit Reddit.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
League of Legends is a popular global online game played by millions of players monthly. In the past few years, the League of Legends e-sports industry has shown phenomenal growth. Just recently in 2020, the World Championship finals drew 3.8 million peak viewers! While the e-sports industry still lags behind traditional sports in terms of popularity and viewership, it has shown exponential growth in certain regions with fast-growing economy, such as Vietnam and China, making it a prime target for sponsorship for foreign companies looking to spread brand awareness in these regions.
While the e-sports data industry is also showing gradual growth, there is not much available publicly in terms of published analysis of individual games. This may be due to the fact that the games are fast-changing compared to traditional sports--rules and game stats are frequently and arbitrarily changed by the developers. Nevertheless it is an interesting field for fun researches: hence the reason for many pet projects and graduate-level papers dedicated to this field.
All existing League of Legends games (minus custom games, including ones from competitions) are made available by Riot's API. However, having to request and parse the data for every single relevant game is quite annoying; this dataset intends to save that work for you. To make things (hopefully) easier, I parsed all JSON files returned by Riot API into CSV files, with each row corresponding to one game.
This dataset consists of three parts: root games, root2tail, and tail games.
I found that quite often when trying to predict the outcome of a match prior to its play, the historical matches of a player prior to that game count as an important factor (Hall, 2017). For such purpose, root games contains 1087 games from which tail games branches out.
Tail games contains historical matches of each player for every game in root games. Root2tail maps root games's each player's account ID and that player's controlled champion ID to a list of matches that can be found in tail games.
To simplify the explanation, if you want to access historical matches of a player in root games file, 1. Get player's account ID and the game ID. 2. Load root2tail file. 3. Queue for matching row on account ID and game ID. 4. The corresponding row contains a list of game IDs that can be queued on tail_games files.
Note that root2tail documents most recent 5 matches, or a list of matches played within the past 5 weeks, prior to the game creation date of the corresponding "root game". It also only documents the most recent games by the player played with the same champion he/she played in the "root game". In cases where there is an empty list, it means the player has not played a single match with the same champion within the past 5 weeks.
On 2020, December 5th, I fetched the list of current players in Challenger tier, then recursively gathered historical matches of those players to consist root games, so this is the data collection date.
Root2tail is self-explanatory. As for the other files, each row represents a single game. The columns are quite confusing, however, as it is a flattened version of a JSON file with nested lists of dictionaries.
I tried to think of the simplest way to make the columns comprehensible, but looking at the original JSON file is most likely the simplest way to understand the structure. Use tools like https://jsonformatter.curiousconcept.com/ to inspect the dummy_league_match.json file.
A very simple explanation: participant.stats._ and participant.timeline._ contains pretty much all match-related statistics of a player during the game.
Also, note that the "accountId" fields use encrypted account IDs which are specific to my API key. If you want to do additional research using player account IDs, you should fetch the match file first and get your own list of player account IDs.
The following are great resources I got a lot of help from: 1. https://riot-watcher.readthedocs.io/en/latest/ 2. https://riot-api-libraries.readthedocs.io/en/latest/
These two actually explain everything you need to get started on your own project with Riot API.
The following are links to related projects that could maybe help you get ideas!
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This dataset contains citation-based impact indicators (a.k.a, "measures") for ~187,8M distinct PIDs (persistent identifiers) that correspond to research products (scientific publications, datasets, etc). In particular, for each PID, we have calculated the following indicators (organized in categories based on the semantics of the impact aspect that they better capture):
Influence indicators (i.e., indicators of the "total" impact of each research product; how established it is in general)
Citation Count: The total number of citations of the product, the most well-known influence indicator.
PageRank score: An influence indicator based on the PageRank [1], a popular network analysis method. PageRank estimates the influence of each product based on its centrality in the whole citation network. It alleviates some issues of the Citation Count indicator (e.g., two products with the same number of citations can have significantly different PageRank scores if the aggregated influence of the products citing them is very different - the product receiving citations from more influential products will get a larger score).
Popularity indicators (i.e., indicators of the "current" impact of each research product; how popular the product is currently)
RAM score: A popularity indicator based on the RAM [2] method. It is essentially a Citation Count where recent citations are considered as more important. This type of "time awareness" alleviates problems of methods like PageRank, which are biased against recently published products (new products need time to receive a number of citations that can be indicative for their impact).
AttRank score: A popularity indicator based on the AttRank [3] method. AttRank alleviates PageRank's bias against recently published products by incorporating an attention-based mechanism, akin to a time-restricted version of preferential attachment, to explicitly capture a researcher's preference to examine products which received a lot of attention recently.
Impulse indicators (i.e., indicators of the initial momentum that the research product received right after its publication)
Incubation Citation Count (3-year CC): This impulse indicator is a time-restricted version of the Citation Count, where the time window length is fixed for all products and the time window depends on the publication date of the product, i.e., only citations 3 years after each product's publication are counted.
More details about the aforementioned impact indicators, the way they are calculated and their interpretation can be found here and in the respective references (e.g., in [5]).
From version 5.1 onward, the impact indicators are calculated in two levels:
Previous versions of the dataset only provided the scores at the PID level.
From version 12 onward, two types of PIDs are included in the dataset: DOIs and PMIDs (before that version, only DOIs were included).
Also, from version 7 onward, for each product in our files we also offer an impact class, which informs the user about the percentile into which the product score belongs compared to the impact scores of the rest products in the database. The impact classes are: C1 (in top 0.01%), C2 (in top 0.1%), C3 (in top 1%), C4 (in top 10%), and C5 (in bottom 90%).
Finally, before version 10, the calculation of the impact scores (and classes) was based on a citation network having one node for each product with a distinct PID that we could find in our input data sources. However, from version 10 onward, the nodes are deduplicated using the most recent version of the OpenAIRE article deduplication algorithm. This enabled a correction of the scores (more specifically, we avoid counting citation links multiple times when they are made by multiple versions of the same product). As a result, each node in the citation network we build is a deduplicated product having a distinct OpenAIRE id. We still report the scores at PID level (i.e., we assign a score to each of the versions/instances of the product), however these PID-level scores are just the scores of the respective deduplicated nodes propagated accordingly (i.e., all version of the same deduplicated product will receive the same scores). We have removed a small number of instances (having a PID) that were assigned (by error) to multiple deduplicated records in the OpenAIRE Graph.
For each calculation level (PID / OpenAIRE-id) we provide five (5) compressed CSV files (one for each measure/score provided) where each line follows the format "identifier
From version 9 onward, we also provide topic-specific impact classes for PID-identified products. In particular, we associated those products with 2nd level concepts from OpenAlex; we chose to keep only the three most dominant concepts for each product, based on their confidence score, and only if this score was greater than 0.3. Then, for each product and impact measure, we compute its class within its respective concepts. We provide finally the "topic_based_impact_classes.txt" file where each line follows the format "identifier
The data used to produce the citation network on which we calculated the provided measures have been gathered from the OpenAIRE Graph v7.1.0, including data from (a) OpenCitations' COCI & POCI dataset, (b) MAG [6,7], and (c) Crossref. The union of all distinct citations that could be found in these sources have been considered. In addition, versions later than v.10 leverage the filtering rules described here to remove from the dataset PIDs with problematic metadata.
References:
[1] R. Motwani L. Page, S. Brin and T. Winograd. 1999. The PageRank Citation Ranking: Bringing Order to the Web. Technical Report. Stanford InfoLab.
[2] Rumi Ghosh, Tsung-Ting Kuo, Chun-Nan Hsu, Shou-De Lin, and Kristina Lerman. 2011. Time-Aware Ranking in Dynamic Citation Networks. In Data Mining Workshops (ICDMW). 373–380
[3] I. Kanellos, T. Vergoulis, D. Sacharidis, T. Dalamagas, Y. Vassiliou: Ranking Papers by their Short-Term Scientific Impact. CoRR abs/2006.00951 (2020)
[4] P. Manghi, C. Atzori, M. De Bonis, A. Bardi, Entity deduplication in big data graphs for scholarly communication, Data Technologies and Applications (2020).
[5] I. Kanellos, T. Vergoulis, D. Sacharidis, T. Dalamagas, Y. Vassiliou: Impact-Based Ranking of Scientific Publications: A Survey and Experimental Evaluation. TKDE 2019 (early access)
[6] Arnab Sinha, Zhihong Shen, Yang Song, Hao Ma, Darrin Eide, Bo-June (Paul) Hsu, and Kuansan Wang. 2015. An Overview of Microsoft Academic Service (MA) and Applications. In Proceedings of the 24th International Conference on World Wide Web (WWW '15 Companion). ACM, New York, NY, USA, 243-246. DOI=http://dx.doi.org/10.1145/2740908.2742839
[7] K. Wang et al., "A Review of Microsoft Academic Services for Science of Science Studies", Frontiers in Big Data, 2019, doi: 10.3389/fdata.2019.00045
Find our Academic Search Engine built on top of these data here. Further note, that we also provide all calculated scores through BIP! Finder's API.
Terms of use: These data are provided "as is", without any warranties of any kind. The data are provided under the CC0 license.
More details about BIP! DB can be found in our relevant peer-reviewed publication:
Thanasis Vergoulis, Ilias Kanellos, Claudio Atzori, Andrea Mannocci, Serafeim Chatzopoulos, Sandro La Bruzzo, Natalia Manola, Paolo Manghi: BIP! DB: A Dataset of Impact Measures for Scientific Publications. WWW (Companion Volume) 2021: 456-460
We kindly request that any published research that makes use of BIP! DB cite the above article.
Facebook
TwitterAttribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
The Russian Financial Statements Database (RFSD) is an open, harmonized collection of annual unconsolidated financial statements of the universe of Russian firms:
🔓 First open data set with information on every active firm in Russia.
🗂️ First open financial statements data set that includes non-filing firms.
🏛️ Sourced from two official data providers: the Rosstat and the Federal Tax Service.
📅 Covers 2011-2023 initially, will be continuously updated.
🏗️ Restores as much data as possible through non-invasive data imputation, statement articulation, and harmonization.
The RFSD is hosted on 🤗 Hugging Face and Zenodo and is stored in a structured, column-oriented, compressed binary format Apache Parquet with yearly partitioning scheme, enabling end-users to query only variables of interest at scale.
The accompanying paper provides internal and external validation of the data: http://arxiv.org/abs/2501.05841.
Here we present the instructions for importing the data in R or Python environment. Please consult with the project repository for more information: http://github.com/irlcode/RFSD.
Importing The Data
You have two options to ingest the data: download the .parquet files manually from Hugging Face or Zenodo or rely on 🤗 Hugging Face Datasets library.
Python
🤗 Hugging Face Datasets
It is as easy as:
from datasets import load_dataset import polars as pl
RFSD = load_dataset('irlspbru/RFSD')
RFSD_2023 = pl.read_parquet('hf://datasets/irlspbru/RFSD/RFSD/year=2023/*.parquet')
Please note that the data is not shuffled within year, meaning that streaming first n rows will not yield a random sample.
Local File Import
Importing in Python requires pyarrow package installed.
import pyarrow.dataset as ds import polars as pl
RFSD = ds.dataset("local/path/to/RFSD")
print(RFSD.schema)
RFSD_full = pl.from_arrow(RFSD.to_table())
RFSD_2019 = pl.from_arrow(RFSD.to_table(filter=ds.field('year') == 2019))
RFSD_2019_revenue = pl.from_arrow( RFSD.to_table( filter=ds.field('year') == 2019, columns=['inn', 'line_2110'] ) )
renaming_df = pl.read_csv('local/path/to/descriptive_names_dict.csv') RFSD_full = RFSD_full.rename({item[0]: item[1] for item in zip(renaming_df['original'], renaming_df['descriptive'])})
R
Local File Import
Importing in R requires arrow package installed.
library(arrow) library(data.table)
RFSD <- open_dataset("local/path/to/RFSD")
schema(RFSD)
scanner <- Scanner$create(RFSD) RFSD_full <- as.data.table(scanner$ToTable())
scan_builder <- RFSD$NewScan() scan_builder$Filter(Expression$field_ref("year") == 2019) scanner <- scan_builder$Finish() RFSD_2019 <- as.data.table(scanner$ToTable())
scan_builder <- RFSD$NewScan() scan_builder$Filter(Expression$field_ref("year") == 2019) scan_builder$Project(cols = c("inn", "line_2110")) scanner <- scan_builder$Finish() RFSD_2019_revenue <- as.data.table(scanner$ToTable())
renaming_dt <- fread("local/path/to/descriptive_names_dict.csv") setnames(RFSD_full, old = renaming_dt$original, new = renaming_dt$descriptive)
Use Cases
🌍 For macroeconomists: Replication of a Bank of Russia study of the cost channel of monetary policy in Russia by Mogiliat et al. (2024) — interest_payments.md
🏭 For IO: Replication of the total factor productivity estimation by Kaukin and Zhemkova (2023) — tfp.md
🗺️ For economic geographers: A novel model-less house-level GDP spatialization that capitalizes on geocoding of firm addresses — spatialization.md
FAQ
Why should I use this data instead of Interfax's SPARK, Moody's Ruslana, or Kontur's Focus?hat is the data period?
To the best of our knowledge, the RFSD is the only open data set with up-to-date financial statements of Russian companies published under a permissive licence. Apart from being free-to-use, the RFSD benefits from data harmonization and error detection procedures unavailable in commercial sources. Finally, the data can be easily ingested in any statistical package with minimal effort.
What is the data period?
We provide financials for Russian firms in 2011-2023. We will add the data for 2024 by July, 2025 (see Version and Update Policy below).
Why are there no data for firm X in year Y?
Although the RFSD strives to be an all-encompassing database of financial statements, end users will encounter data gaps:
We do not include financials for firms that we considered ineligible to submit financial statements to the Rosstat/Federal Tax Service by law: financial, religious, or state organizations (state-owned commercial firms are still in the data).
Eligible firms may enjoy the right not to disclose under certain conditions. For instance, Gazprom did not file in 2022 and we had to impute its 2022 data from 2023 filings. Sibur filed only in 2023, Novatek — in 2020 and 2021. Commercial data providers such as Interfax's SPARK enjoy dedicated access to the Federal Tax Service data and therefore are able source this information elsewhere.
Firm may have submitted its annual statement but, according to the Uniform State Register of Legal Entities (EGRUL), it was not active in this year. We remove those filings.
Why is the geolocation of firm X incorrect?
We use Nominatim to geocode structured addresses of incorporation of legal entities from the EGRUL. There may be errors in the original addresses that prevent us from geocoding firms to a particular house. Gazprom, for instance, is geocoded up to a house level in 2014 and 2021-2023, but only at street level for 2015-2020 due to improper handling of the house number by Nominatim. In that case we have fallen back to street-level geocoding. Additionally, streets in different districts of one city may share identical names. We have ignored those problems in our geocoding and invite your submissions. Finally, address of incorporation may not correspond with plant locations. For instance, Rosneft has 62 field offices in addition to the central office in Moscow. We ignore the location of such offices in our geocoding, but subsidiaries set up as separate legal entities are still geocoded.
Why is the data for firm X different from https://bo.nalog.ru/?
Many firms submit correcting statements after the initial filing. While we have downloaded the data way past the April, 2024 deadline for 2023 filings, firms may have kept submitting the correcting statements. We will capture them in the future releases.
Why is the data for firm X unrealistic?
We provide the source data as is, with minimal changes. Consider a relatively unknown LLC Banknota. It reported 3.7 trillion rubles in revenue in 2023, or 2% of Russia's GDP. This is obviously an outlier firm with unrealistic financials. We manually reviewed the data and flagged such firms for user consideration (variable outlier), keeping the source data intact.
Why is the data for groups of companies different from their IFRS statements?
We should stress that we provide unconsolidated financial statements filed according to the Russian accounting standards, meaning that it would be wrong to infer financials for corporate groups with this data. Gazprom, for instance, had over 800 affiliated entities and to study this corporate group in its entirety it is not enough to consider financials of the parent company.
Why is the data not in CSV?
The data is provided in Apache Parquet format. This is a structured, column-oriented, compressed binary format allowing for conditional subsetting of columns and rows. In other words, you can easily query financials of companies of interest, keeping only variables of interest in memory, greatly reducing data footprint.
Version and Update Policy
Version (SemVer): 1.0.0.
We intend to update the RFSD annualy as the data becomes available, in other words when most of the firms have their statements filed with the Federal Tax Service. The official deadline for filing of previous year statements is April, 1. However, every year a portion of firms either fails to meet the deadline or submits corrections afterwards. Filing continues up to the very end of the year but after the end of April this stream quickly thins out. Nevertheless, there is obviously a trade-off between minimization of data completeness and version availability. We find it a reasonable compromise to query new data in early June, since on average by the end of May 96.7% statements are already filed, including 86.4% of all the correcting filings. We plan to make a new version of RFSD available by July.
Licence
Creative Commons License Attribution 4.0 International (CC BY 4.0).
Copyright © the respective contributors.
Citation
Please cite as:
@unpublished{bondarkov2025rfsd, title={{R}ussian {F}inancial {S}tatements {D}atabase}, author={Bondarkov, Sergey and Ledenev, Victor and Skougarevskiy, Dmitriy}, note={arXiv preprint arXiv:2501.05841}, doi={https://doi.org/10.48550/arXiv.2501.05841}, year={2025}}
Acknowledgments and Contacts
Data collection and processing: Sergey Bondarkov, sbondarkov@eu.spb.ru, Viktor Ledenev, vledenev@eu.spb.ru
Project conception, data validation, and use cases: Dmitriy Skougarevskiy, Ph.D.,
Facebook
TwitterBy Gary Hoover [source]
This dataset contains all the record-breaking temperatures for your favorite US cities in 2015. With this information, you can prepare for any unexpected weather that may come your way in the future, or just revel in the beauty of these high heat spells from days past! With record highs spanning from January to December, stay warm (or cool) with these handy historical temperature data points
For more datasets, click here.
- 🚨 Your notebook can be here! 🚨!
This dataset contains the record high temperatures for various US cities during the year of 2015. The dataset includes columns for each individual month, along with column for the records highs over the entire year. This data is sourced from www.weatherbase.com and can be used to analyze which cities experienced hot summers, or compare temperature variations between different regions.
Here are some useful tips on how to work with this dataset: - Analyze individual monthly temperatures - this dataset allows you to compare high temperatures across months and locations in order to identify which areas experienced particularly hot summers or colder winters.
- Compare annual versus monthly data - use this data to compare average annual highs against monthly highs in order to understand temperature trends at a given location throughout all four seasons of a single year, or explore how different regions vary based on yearly weather patterns as well as across given months within any one year; - Heatmap analysis - use this data plot temperature information in an interactive heatmap format in order to pinpoint particular regions that experience unique weather conditions or higher-than-average levels of warmth compared against cooler pockets of similar size geographic areas; - Statistically model the relationships between independent variables (temperature variations by month, region/city and more!) and dependent variables (e.g., tourism volumes). Use regression techniques such as linear models (OLS), ARIMA models/nonlinear transformations and other methods through statistical software such as STATA or R programming language;
- Look into climate trends over longer periods - adjust time frames included in analyses beyond 2018 when possible by expanding upon the monthly station observations already present within the study timeframe utilized here; take advantage of digitally available historical temperature readings rather than relying only upon printed reportsWith these helpful tips, you can get started analyzing record high temperatures for US cities during 2015 using our 'Record High Temperatures for US Cities' dataset!
- Create a heat map chart of US cities representing the highest temperature on record for each city from 2015.
- Analyze trends in monthly high temperatures in order to predict future climate shifts and weather patterns across different US cities.
- Track and compare monthly high temperature records for all US cities to identify regional hot spots with higher than average records and potential implications for agriculture and resource management planning
If you use this dataset in your research, please credit the original authors. Data Source
Unknown License - Please check the dataset description for more information.
File: Highest temperature on record through 2015 by US City.csv | Column name | Description | |:--------------|:--------------------------------------------------------------| | CITY | Name of the city. (String) | | JAN | Record high temperature for the month of January. (Integer) | | FEB | Record high temperature for the month of February. (Integer) | | MAR | Record high temperature for the month of March. (Integer) | | APR | Record high temperature for the month of April. (Integer) | | MAY | Record high temperature for the month of May. (Integer) | | JUN | Record high temperature for the month of June. (Integer) | | JUL | Record high temperature for the month of July. (Integer) | | AUG | Record high temperature for the month of August. (Integer) | | SEP | Record high temperature for the month of September. (Integer) | | OCT | Record high temperature for the month of October. (Integer) | | ...
Facebook
Twitterhttps://www.bco-dmo.org/dataset/2987/licensehttps://www.bco-dmo.org/dataset/2987/license
Confirmed right whale identifications in Cape Cod Bay and adjacent waters and sighting histories, from the R/V Shearwater NEC-MB2002-1 (1998-2002), and historical records from 1980 (Northeast Consortium Cooperative Research project).
access_formats=.htmlTable,.csv,.json,.mat,.nc,.tsv
acquisition_description=Photographic Methods
i) Identification Photographs:
During aerial and shipboard surveys, photographs were taken on Kodak
Kodachrome 200ASA color slide film, using hand-held 35-mm cameras equipped
with 300-mm telephoto lenses and motor drives. From the air, photographers
attempted to obtain good perpendicular photographs of the entire rostral
callosity pattern and back of every right whale encountered as well as any
other scars or markings. From the boat, photographers attempted to collect
good oblique photographs of both sides of the head and chin, the body and the
flukes. The data recorder on both platforms was responsible for keeping a
written record of the roll and frame numbers shot by each photographer in the
daily log.
ii) Photo-analysis and Matching:
Photographs of right whale callosity patterns are used as a basis for
identification and cataloging of individuals, following methods developed by
Payne et al (1983) and Kraus et al (1986). The cataloging of individually
identified animals is based on using high quality photographs of distinctive
callosity patterns (raised patches of roughened skin on the top and sides of
the head), ventral pigmentation, lip ridges, and scars (Kraus et al 1986). New
England Aquarium (NEAq) has curated the catalogue since 1980 and to the best
of their knowledge, all photographs of right whales taken in the North
Atlantic since 1935 have been included in NEAq's files. This catalogue allows
scientists to enumerate the population, and, from resightings of known
individuals, to monitor the animals' reproductive status, births, deaths,
scarring, distribution and migrations. Since 1980, a total of 26,275 sightings
of 436 individual right whales have been archived, of which 327 are thought to
be alive, as of December 2001 (A. Knowlton, NEAq, pers. comm.)
The matching process consists of separating photographs of right whales into individuals and inter-matching between days within the season. To match different sightings of the same whale, composite drawings and photographs of the callosity patterns of individual right whales are compared to a limited subset of the catalogue that includes animals with a similar appearance. For whales that look alike in the first sort, the original photographs of all probable matches are examined for callosity similarities and supplementary features, including scars, pigmentation, lip crenulations, and morphometric ratios. A match between different sightings is considered positive when the callosity pattern and at least one other feature can be independently matched by at least two experienced researchers (Kraus et al 1986). Exceptions to this multiple identifying feature requirement include whales that have unusual callosity patterns, large scars or birthmarks, or deformities so unique that matches from clear photographs can be based on only one feature. Preliminary photo-analysis and inter-matching was carried out at CCS, with matches confirmed using original photographs cataloged and archived at NEAq.
iii) Photographic Data Archiving
Upon completion of the matching process, all original slides were returned
to CCS and incorporated into the CCS catalogue of identified right whales to
update existing files, using the same numbering system as NEAq, in archival
quality slide sheets. NEAq archives copies of photographs representing each
sighting. Copies of photographs of individuals that are better than existing
records, and photographs of newly identified whales, will be included in the
NEAq master files as "type specimens" for future reference. The master files
are maintained in fireproof safes at NEAq. All catalogue files are available
for inspection and on-site use by contributors and collaborators.
awards_0_award_nid=55048
awards_0_award_number=unknown NEC-CoopRes NOAA
awards_0_funder_name=National Oceanic and Atmospheric Administration
awards_0_funding_acronym=NOAA
awards_0_funding_source_nid=352
cdm_data_type=Other
comment=Whale sighting history
P.I. Moira Brown
Confirmed right whale identifications in Cape Cod Bay
and adjacent waters 1998-2002 and sighting histories.
(report appendix I)
Conventions=COARDS, CF-1.6, ACDD-1.3
data_source=extract_data_as_tsv version 2.3 19 Dec 2019
defaultDataQuery=&time<now
doi=10.1575/1912/bco-dmo.2987.1
infoUrl=https://www.bco-dmo.org/dataset/2987
institution=BCO-DMO
instruments_0_acronym=camera
instruments_0_dataset_instrument_description=35mm camera
instruments_0_dataset_instrument_nid=4728
instruments_0_description=All types of photographic equipment including stills, video, film and digital systems.
instruments_0_instrument_external_identifier=https://vocab.nerc.ac.uk/collection/L05/current/311/
instruments_0_instrument_name=Camera
instruments_0_instrument_nid=520
instruments_0_supplied_name=Camera
metadata_source=https://www.bco-dmo.org/api/dataset/2987
param_mapping={'2987': {}}
parameter_source=https://www.bco-dmo.org/mapserver/dataset/2987/parameters
people_0_affiliation=Massachusetts Division of Marine Fisheries
people_0_person_name=Dr Daniel McKiernan
people_0_person_nid=51014
people_0_role=Principal Investigator
people_0_role_type=originator
people_1_affiliation=Provincetown Center for Coastal Studies
people_1_affiliation_acronym=PCCS
people_1_person_name=Dr Moira Brown
people_1_person_nid=51013
people_1_role=Co-Principal Investigator
people_1_role_type=originator
people_2_affiliation=Provincetown Center for Coastal Studies
people_2_affiliation_acronym=PCCS
people_2_person_name=Dr Charles Mayo
people_2_person_nid=51015
people_2_role=Co-Principal Investigator
people_2_role_type=originator
people_3_affiliation=Woods Hole Oceanographic Institution
people_3_affiliation_acronym=WHOI BCO-DMO
people_3_person_name=Nancy Copley
people_3_person_nid=50396
people_3_role=BCO-DMO Data Manager
people_3_role_type=related
project=NEC-CoopRes
projects_0_acronym=NEC-CoopRes
projects_0_description=The Northeast Consortium encourages and funds cooperative research and monitoring projects in the Gulf of Maine and Georges Bank that have effective, equal partnerships among fishermen, scientists, educators, and marine resource managers.
The Northeast Consortium seeks to fund projects that will be conducted in a responsible manner. Cooperative research projects are designed to minimize any negative impacts to ecosystems or marine organisms, and be consistent with accepted ethical research practices, including the use of animals and human subjects in research, scrutiny of research protocols by an institutional board of review, etc.
projects_0_geolocation=Georges Bank, Gulf of Maine
projects_0_name=Northeast Consortium: Cooperative Research
projects_0_project_nid=2045
projects_0_project_website=http://northeastconsortium.org/
projects_0_start_date=1999-01
sourceUrl=(local files)
standard_name_vocabulary=CF Standard Name Table v55
version=1
xml_source=osprey2erddap.update_xml() v1.3
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
PUDL v2025.2.0 Data Release
This is our regular quarterly release for 2025Q1. It includes updates to all the datasets that are published with quarterly or higher frequency, plus initial verisons of a few new data sources that have been in the works for a while.
One major change this quarter is that we are now publishing all processed PUDL data as Apache Parquet files, alongside our existing SQLite databases. See Data Access for more on how to access these outputs.
Some potentially breaking changes to be aware of:
In the EIA Form 930 – Hourly and Daily Balancing Authority Operations Report a number of new energy sources have been added, and some old energy sources have been split into more granular categories. See Changes in energy source granularity over time.
We are now running the EPA’s CAMD to EIA unit crosswalk code for each individual year starting from 2018, rather than just 2018 and 2021, resulting in more connections between these two datasets and changes to some sub-plant IDs. See the note below for more details.
Many thanks to the organizations who make these regular updates possible! Especially GridLab, RMI, and the ZERO Lab at Princeton University. If you rely on PUDL and would like to help ensure that the data keeps flowing, please consider joining them as a PUDL Sustainer, as we are still fundraising for 2025.
New Data
EIA 176
Add a couple of semi-transformed interim EIA-176 (natural gas sources and dispositions) tables. They aren’t yet being written to the database, but are one step closer. See #3555 and PRs #3590, #3978. Thanks to @davidmudrauskas for moving this dataset forward.
Extracted these interim tables up through the latest 2023 data release. See #4002 and #4004.
EIA 860
Added EIA 860 Multifuel table. See #3438 and #3946.
FERC 1
Added three new output tables containing granular utility accounting data. See #4057, #3642 and the table descriptions in the data dictionary:
out_ferc1_yearly_detailed_income_statements
out_ferc1_yearly_detailed_balance_sheet_assets
out_ferc1_yearly_detailed_balance_sheet_liabilities
SEC Form 10-K Parent-Subsidiary Ownership
We have added some new tables describing the parent-subsidiary company ownership relationships reported in the SEC’s Form 10-K, Exhibit 21 “Subsidiaries of the Registrant”. Where possible these tables link the SEC filers or their subsidiary companies to the corresponding EIA utilities. This work was funded by a grant from the Mozilla Foundation. Most of the ML models and data preparation took place in the mozilla-sec-eia repository separate from the main PUDL ETL, as it requires processing hundreds of thousands of PDFs and the deployment of some ML experiment tracking infrastructure. The new tables are handed off as nearly finished products to the PUDL ETL pipeline. Note that these are preliminary, experimental data products and are known to be incomplete and to contain errors. Extracting data tables from unstructured PDFs and the SEC to EIA record linkage are necessarily probabalistic processes.
See PRs #4026, #4031, #4035, #4046, #4048, #4050 and check out the table descriptions in the PUDL data dictionary:
out_sec10k_parents_and_subsidiaries
core_sec10k_quarterly_filings
core_sec10k_quarterly_exhibit_21_company_ownership
core_sec10k_quarterly_company_information
Expanded Data Coverage
EPA CEMS
Added 2024 Q4 of CEMS data. See #4041 and #4052.
EPA CAMD EIA Crosswalk
In the past, the crosswalk in PUDL has used the EPA’s published crosswalk (run with 2018 data), and an additional crosswalk we ran with 2021 EIA 860 data. To ensure that the crosswalk reflects updates in both EIA and EPA data, we re-ran the EPA R code which generates the EPA CAMD EIA crosswalk with 4 new years of data: 2019, 2020, 2022 and 2023. Re-running the crosswalk pulls the latest data from the CAMD FACT API, which results in some changes to the generator and unit IDs reported on the EPA side of the crosswalk, which feeds into the creation of core_epa_assn_eia_epacamd.
The changes only result in the addition of new units and generators in the EPA data, with no changes to matches at the plant level. However, the updates to generator and unit IDs have resulted in changes to the subplant IDs - some EIA boilers and generators which previously had no matches to EPA data have now been matched to EPA unit data, resulting in an overall reduction in the number of rows in the core_epa_assn_eia_epacamd_subplant_ids table. See issues #4039 and PR #4056 for a discussion of the changes observed in the course of this update.
EIA 860M
Added EIA 860m through December 2024. See #4038 and #4047.
EIA 923
Added EIA 923 monthly data through September 2024. See #4038 and #4047.
EIA Bulk Electricity Data
Updated the EIA Bulk Electricity data to include data published up through 2024-11-01. See #4042 and PR #4051.
EIA 930
Updated the EIA 930 data to include data published up through the beginning of February 2025. See #4040 and PR #4054. 10 new energy sources were added and 3 were retired; see Changes in energy source granularity over time for more information.
Bug Fixes
Fix an accidentally swapped set of starting balance / ending balance column rename parameters in the pre-2021 DBF derived data that feeds into core_ferc1_yearly_other_regulatory_liabilities_sched278. See issue #3952 and PRs #3969, #3979. Thanks to @yolandazzz13 for making this fix.
Added preliminary data validation checks for several FERC 1 tables that were missing it #3860.
Fix spelling of Lake Huron and Lake Saint Clair in out_vcerare_hourly_available_capacity_factor and related tables. See issue #4007 and PR #4029.
Quality of Life Improvements
We added a sources parameter to pudl.metadata.classes.DataSource.from_id() in order to make it possible to use the pudl-archiver repository to archive datasets that won’t necessarily be ingested into PUDL. See this PUDL archiver issue and PRs #4003 and #4013.
Other PUDL v2025.2.0 Resources
PUDL v2025.2.0 Data Dictionary
PUDL v2025.2.0 Documentation
PUDL in the AWS Open Data Registry
PUDL v2025.2.0 in a free, public AWS S3 bucket: s3://pudl.catalyst.coop/v2025.2.0/
PUDL v2025.2.0 in a requester-pays GCS bucket: gs://pudl.catalyst.coop/v2025.2.0/
Zenodo archive of the PUDL GitHub repo for this release
PUDL v2025.2.0 release on GitHub
PUDL v2025.2.0 package in the Python Package Index (PyPI)
Contact Us
If you're using PUDL, we would love to hear from you! Even if it's just a note to let us know that you exist, and how you're using the software or data. Here's a bunch of different ways to get in touch:
Follow us on GitHub
Use the PUDL Github issue tracker to let us know about any bugs or data issues you encounter
GitHub Discussions is where we provide user support.
Watch our GitHub Project to see what we're working on.
Email us at hello@catalyst.coop for private communications.
On Mastodon: @CatalystCoop@mastodon.energy
On BlueSky: @catalyst.coop
On Twitter: @CatalystCoop
Connect with us on LinkedIn
Play with our data and notebooks on Kaggle
Combine our data with ML models on HuggingFace
Learn more about us on our website: https://catalyst.coop
Subscribe to our announcements list for email updates.
Facebook
TwitterProgram of vessel mount ADCP measurements comprising a combination of 300kHz and 75kHz ADCP data collected in the vicinity of the Loop Current and drilling blocks between 2004 and 2007. _NCProperties=version=2,netcdf=4.7.4,hdf5=1.12.0, acknowledgement=Data collection funded by various oil industry operators cdm_data_type=TrajectoryProfile cdm_profile_variables=time cdm_trajectory_variables=trajectory CODAS_processing_note=
The CODAS database is a specialized storage format designed for shipboard ADCP data. "CODAS processing" uses this format to hold averaged shipboard ADCP velocities and other variables, during the stages of data processing. The CODAS database stores velocity profiles relative to the ship as east and north components along with position, ship speed, heading, and other variables. The netCDF short form contains ocean velocities relative to earth, time, position, transducer temperature, and ship heading; these are designed to be "ready for immediate use". The netCDF long form is just a dump of the entire CODAS database. Some variables are no longer used, and all have names derived from their original CODAS names, dating back to the late 1980's.
CODAS post-processing, i.e. that which occurs after the single-ping profiles have been vector-averaged and loaded into the CODAS database, includes editing (using automated algorithms and manual tools), rotation and scaling of the measured velocities, and application of a time-varying heading correction. Additional algorithms developed more recently include translation of the GPS positions to the transducer location, and averaging of ship's speed over the times of valid pings when Percent Good is reduced. Such post-processing is needed prior to submission of "processed ADCP data" to JASADCP or other archives.
Whenever single-ping data have been recorded, full CODAS processing provides the best end product.
Full CODAS processing starts with the single-ping velocities in beam coordinates. Based on the transducer orientation relative to the hull, the beam velocities are transformed to horizontal, vertical, and "error velocity" components. Using a reliable heading (typically from the ship's gyro compass), the velocities in ship coordinates are rotated into earth coordinates.
Pings are grouped into an "ensemble" (usually 2-5 minutes duration) and undergo a suite of automated editing algorithms (removal of acoustic interference; identification of the bottom; editing based on thresholds; and specialized editing that targets CTD wire interference and "weak, biased profiles". The ensemble of single-ping velocities is then averaged using an iterative reference layer averaging scheme. Each ensemble is approximated as a single function of depth, with a zero-average over a reference layer plus a reference layer velocity for each ping. Adding the average of the single-ping reference layer velocities to the function of depth yields the ensemble-average velocity profile. These averaged profiles, along with ancillary measurements, are written to disk, and subsequently loaded into the CODAS database. Everything after this stage is "post-processing".
Time is stored in the database using UTC Year, Month, Day, Hour, Minute, Seconds. Floating point time "Decimal Day" is the floating point interval in days since the start of the year, usually the year of the first day of the cruise.
CODAS processing uses heading from a reliable device, and (if available) uses a time-dependent correction by an accurate heading device. The reliable heading device is typically a gyro compass (for example, the Bridge gyro). Accurate heading devices can be POSMV, Seapath, Phins, Hydrins, MAHRS, or various Ashtech devices; this varies with the technology of the time. It is always confusing to keep track of the sign of the heading correction. Headings are written degrees, positive clockwise. setting up some variables:
X = transducer angle (CONFIG1_heading_bias) positive clockwise (beam 3 angle relative to ship) G = Reliable heading (gyrocompass) A = Accurate heading dh = G - A = time-dependent heading correction (ANCIL2_watrk_hd_misalign)
Rotation of the measured velocities into the correct coordinate system amounts to (u+i*v)*(exp(i*theta)) where theta is the sum of the corrected heading and the transducer angle.
theta = X + (G - dh) = X + G - dh
Watertrack and Bottomtrack calibrations give an indication of the residual angle offset to apply, for example if mean and median of the phase are all 0.5 (then R=0.5). Using the "rotate" command, the value of R is added to "ANCIL2_watrk_hd_misalign".
new_dh = dh + R
Therefore the total angle used in rotation is
new_theta = X + G - dh_new = X + G - (dh + R) = (X - R) + (G - dh)
The new estimate of the transducer angle is: X - R ANCIL2_watrk_hd_misalign contains: dh + R
====================================================
Profile editing flags are provided for each depth cell:
binary decimal below Percent value value bottom Good bin -------+----------+--------+----------+-------+ 000 0 001 1 bad 010 2 bad 011 3 bad bad 100 4 bad 101 5 bad bad 110 6 bad bad 111 7 bad bad bad -------+----------+--------+----------+-------+
CODAS_variables= Variables in this CODAS short-form Netcdf file are intended for most end-user scientific analysis and display purposes. For additional information see the CODAS_processing_note global attribute and the attributes of each of the variables.
============= ================================================================= time Time at the end of the ensemble, days from start of year. lon, lat Longitude, Latitude from GPS at the end of the ensemble. u,v Ocean zonal and meridional velocity component profiles. uship, vship Zonal and meridional velocity components of the ship. heading Mean ship heading during the ensemble. depth Bin centers in nominal meters (no sound speed profile correction). tr_temp ADCP transducer temperature. pg Percent Good pings for u, v averaging after editing. pflag Profile Flags based on editing, used to mask u, v. amp Received signal strength in ADCP-specific units; no correction for spreading or attenuation. ============= =================================================================
contributor_name=RPS contributor_role=editor contributor_role_vocabulary=https://vocab.nerc.ac.uk/collection/G04/current/ Conventions=CF-1.6, ACDD-1.3, IOOS Metadata Profile Version 1.2, COARDS cruise_id=Fugro_wh75 description=Shipboard ADCP velocity profiles from Fugro_wh75 using instrument wh75 Easternmost_Easting=-89.72401111111111 featureType=TrajectoryProfile geospatial_bounds=LINESTRING (-90.02310555555556 27.051475, -89.72401111111111 27.237783333333333) geospatial_bounds_crs=EPSG:4326 geospatial_bounds_vertical_crs=EPSG:5703 geospatial_lat_max=27.237783333333333 geospatial_lat_min=27.051475 geospatial_lat_units=degrees_north geospatial_lon_max=-89.72401111111111 geospatial_lon_min=-90.02310555555556 geospatial_lon_units=degrees_east geospatial_vertical_max=651.63 geospatial_vertical_min=27.63 geospatial_vertical_positive=down geospatial_vertical_units=m hg_changeset=2924:48293b7d29a9 history=Created: 2019-07-15 17:47:26 UTC id=C16185_075_Line1040_0 infoUrl=ADD ME institution=GCOOS instrument=In Situ/Laboratory Instruments > Profilers/Sounders > Acoustic Sounders > ADCP > Acoustic Doppler Current Profiler keywords_vocabulary=GCMD Science Keywords naming_authority=edu.tamucc.gulfhub Northernmost_Northing=27.237783333333333 platform=ship platform_vocabulary=https://mmisw.org/ont/ioos/platform processing_level=QA'ed and checked by Oceanographer program=Oil and Gas Loop Current VMADCP Program project=O&G LC VMADCP Program software=pycurrents sonar=wh75 source=Current profiler sourceUrl=(local files) Southernmost_Northing=27.051475 standard_name_vocabulary=CF Standard Name Table v67 subsetVariables=time, longitude, latitude, depth, u, v time_coverage_duration=P0Y0M0DT4H14M9S time_coverage_end=2006-05-21T12:09:36Z time_coverage_resolution=P0Y0M0DT0H5M0S time_coverage_start=2006-05-21T07:55:27Z Westernmost_Easting=-90.02310555555556 yearbase=2006
Facebook
Twitterhttps://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Aims: The rapid increase in the number of species that have naturalized beyond their native range is among the most apparent features of the Anthropocene. How alien species will respond to other processes of future global changes is an emerging concern and remains largely misunderstood. We therefore ask whether naturalized species will respond to climate and land-use change differently than those species not yet naturalized anywhere in the world.
Location: Global
Methods: We investigated future changes in the potential alien range of vascular plant species endemic to Europe that are either naturalized (n = 272) or not yet naturalized (1,213) outside of Europe. Potential ranges were estimated based on projections of species distribution models using 20 future climate-change scenarios. We mapped current and future global centres of naturalization risk. We also analyzed expected changes in latitudinal, elevational and areal extent of species’ potential alien ranges.
Results: We showed a large potential for more worldwide naturalizations of European plants currently and in the future. The centres of naturalization risk for naturalized and non-naturalized plants largely overlapped, and their location did not change much under projected future climates. Nevertheless, naturalized plants had their potential range shifting poleward over larger distances, whereas the non-naturalized ones had their potential elevational ranges shifting further upslope under the most severe climate change scenarios. As a result, climate and land-use changes are predicted to shrink the potential alien range of European plants, but less so for already naturalized than for non-naturalized species.
Main conclusions: While currently non-naturalized plants originate frequently from mountain ranges or boreal and Mediterranean biomes in Europe, the naturalized ones usually occur at low elevations, close to human centres of activities. As the latter are expected to increase worldwide, this could explain why the potential alien range of already naturalized plants will shrink less.
Methods Modelling the potential alien ranges of plant species under current climatic and land-use conditions
Species selection
We focused exclusively on vascular plant species endemic to Europe. Here, ‘Europe’ is used in a geographical sense and defined as bordered by the Arctic Ocean to the north, the Atlantic Ocean to the west (the Macaronesian archipelagos were excluded), the Ural Mountains and the Caspian Sea to the east, and the Lesser Caucasus and the Mediterranean Sea to the south (Mediterranean islands included, Anatolia excluded).
The most recent version of the database ‘Endemic vascular plants in Europe’ (EvaplantE; Hobohm, 2014), containing > 6,200 endemic plant taxa, was used here as a baseline for species selection. Scientific names were standardized based on The Plant List (http://www.theplantlist.org/). This taxonomic standardization was done with the R package ‘Taxonstand’ (Cayuela et al., 2017). Infraspecific taxa were excluded from the list, resulting in 4,985 species.
Compilation of occurrence records
To comprehensively compile the distribution of our studied set of endemic species in their native continent, we combined occurrence data in Europe from five sources. The first source was the ‘Global Biodiversity Information Facility’ (GBIF), one of the largest and most widely used biodiversity databases (https://www.gbif.org/). Currently, GBIF provides access to more than 600,000 distributional records for European endemic plant species. Records of European endemic plants deemed erroneous were discarded. All occurrences from GBIF were downloaded using the R package ‘rgbif’ (Chamberlain et al., 2019). The second source was the ‘EU-Forest’ dataset, providing information on European tree species distribution, including more than half a million occurrences at a 1 km (~ 50 arcsec at 50° latitude) resolution (Mauri et al., 2017). The third source we used was the ‘European Vegetation Archive’ (EVA), which assembles observations from more than one million vegetation plots across Europe (Chytrý et al., 2016). The fourth source was the digital version of the Atlas Florae Europaeae offering gridded maps. The fifth source was the ‘Plant Functional Diversity of Grasslands’ network (DIVGRASS), combining data on plant diversity across ~ 70,000 vegetation plots in French permanent grasslands (Violle et al., 2015).
When several occurrences from these different sources were duplicated on the same 0.42° × 0.42° grid cell, only one record was kept to avoid pseudoreplication. After removing duplicate records, species with fewer than 10 occurrences were not further considered since the resulting SDM might be insufficiently accurate. The final dataset comprised 104,313 occurrences for 1,485 European endemic species.
Environmental variables
We selected six environmental predictors related to climate, soil physico-chemical properties and land use, commonly considered to shape the spatial distribution of plants (Gurevitch et al., 2006). Annual mean temperature (°C), annual sum of precipitation (mm) and precipitation seasonality representing the period 1979-2013 were extracted from the CHELSA climate model at a 30 arcsec resolution (Karger et al., 2017). Organic carbon content (g per kg) and soil pH in the first 15 cm of topsoil were extracted at a 1 km resolution from the global gridded soil information database SoilGrids (Hengl et al., 2014). The proportion of primary land-cover (land with natural vegetation that has not been subject to human activity since 1500) averaged over the period 1979-2013 in each 0.5° resolution grid cell (variable ‘gothr’) based on the Harmonized Global Land Use dataset was also used (Chini et al., 2014). Environmental variables were aggregated at a spatial resolution of 0.42° × 0.42° to approach the cell size of the occurrence records with the coarsest resolution (i.e. the Atlas Florae Europaeae).
Species distribution modelling
The potential distribution of 1,485 European endemic plant species was predicted by estimating environmental similarity to the sites of occurrence in Europe. To increase robustness of the predictions, we used six methods to generate species distribution models (SDMs): generalized additive models; generalized linear models; generalized boosting trees; maximum entropy; multivariate adaptive regression splines; and random forests. We evaluated the predictive performance of each SDM using a repeated split sampling approach in which SDMs were calibrated over 75% of the data and evaluated over the remaining 25%. This procedure was repeated 10 times. The evaluation was performed by measuring the area under the receiver operating characteristic (ROC) curve (AUC) and the true skill statistic (TSS). Continuous model predictions were transformed into binary ones by selecting the threshold maximizing TSS to ensure the most accurate predictions since it is based on both sensitivity and specificity.
Results of the different SDM methods were aggregated into a single consensus projection (i.e. map) to reduce uncertainties associated with each technique. To ensure the quality of the ensemble SDMs, we only kept the projections for which the accuracy estimated by AUC and TSS were higher than 0.8 and 0.6, respectively, and assembled the selected SDMs using a committee-average approach with a weight proportional to their TSS evaluation. The entire species distribution modelling process was performed within the ‘biomod2’ R platform (Thuiller et al., 2009).
Modelling the potential alien ranges of plant species under future climatic conditions
To model the potential spread of the European endemic flora outside of Europe in the future (period 2061-2080), we used projections for the four representative concentration pathways (RCPs) of both climate and land cover data for the years 2061-2080. Due to substantial climatic differences predicted by different general circulation models (GCMs), which result in concomitant differences in species range projections, simulations of future climate variables were based on five different GCMs: CCSM4, CESM1-CAM5, CSIRO-mk3-6-0, IPSL-CM5A-LR and MIROC5.
References
Cayuela, L., Stein, A., & Oksanen, J. (2017). Taxonstand: taxonomic standardization of plant species names v.2.1. R Foundation for Statistical Computing. Available at https://cran.r-project.org/web/packages/Taxonstand/index.html.
Chamberlain, S., Barve, V., Desmet, P., Geffert, L., Mcglinn, D., Oldoni, D., & Ram, K. (2019). rgbif: interface to the Global 'Biodiversity' Information Facility API v.1.3.0. R Foundation for Statistical Computing. Available at https://cran.r-project.org/web/packages/rgbif/index.html.
Chini, L.P., Hurtt, G.C., & Frolking, S. (2014). Harmonized Global Land Use for Years 1500 – 2100, V1. Data set. Oak Ridge National Laboratory Distributed Active Archive Center, USA. Available at http://daac.ornl.gov
Chytrý, M., Hennekens, S. M., Jiménez-Alfaro, B., Knollová, I., Dengler, J., Jansen, F., … Yamalov, S. (2016). European Vegetation Archive (EVA): an integrated database of European vegetation plots. Applied Vegetation Science, 19, 173–180.
Hengl, T., de Jesus, J. M., MacMillan, R. A., Batjes, N. H., Heuvelink, G. B. M., Ribeiro, E., … Gonzalez, M. R. (2014). SoilGrids1km — Global Soil Information Based on Automated Mapping. PLoS ONE, 9, e105992.
Hobohm, C. (Ed.) (2014). Endemism in Vascular Plants. [Plant and Vegetation 9]. Dordrecht, The Netherlands: Springer.
Karger, D.N., Conrad, O., Böhner, J., Kawohl, T., Kreft, H., Soria-Auza, R.W., … Kessler, M. (2017). Climatologies at high resolution for the earth’s land surface areas. Scientific Data, 4, 170122.
Mauri, A., Strona, G., & San-Miguel-Ayanz, J. (2017). EU-Forest, a high-resolution tree occurrence dataset for Europe. Scientific Data, 4,
Facebook
Twitterhttps://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Community science image libraries offer a massive, but largely untapped, source of observational data for phenological research. The iNaturalist platform offers a particularly rich archive, containing more than 49 million verifiable, georeferenced, open access images, encompassing seven continents and over 278,000 species. A critical limitation preventing scientists from taking full advantage of this rich data source is labor. Each image must be manually inspected and categorized by phenophase, which is both time-intensive and costly. Consequently, researchers may only be able to use a subset of the total number of images available in the database. While iNaturalist has the potential to yield enough data for high-resolution and spatially extensive studies, it requires more efficient tools for phenological data extraction. A promising solution is automation of the image annotation process using deep learning. Recent innovations in deep learning have made these open-source tools accessible to a general research audience. However, it is unknown whether deep learning tools can accurately and efficiently annotate phenophases in community science images. Here, we train a convolutional neural network (CNN) to annotate images of Alliaria petiolata into distinct phenophases from iNaturalist and compare the performance of the model with non-expert human annotators. We demonstrate that researchers can successfully employ deep learning techniques to extract phenological information from community science images. A CNN classified two-stage phenology (flowering and non-flowering) with 95.9% accuracy and classified four-stage phenology (vegetative, budding, flowering, and fruiting) with 86.4% accuracy. The overall accuracy of the CNN did not differ from humans (p = 0.383), although performance varied across phenophases. We found that a primary challenge of using deep learning for image annotation was not related to the model itself, but instead in the quality of the community science images. Up to 4% of A. petiolata images in iNaturalist were taken from an improper distance, were physically manipulated, or were digitally altered, which limited both human and machine annotators in accurately classifying phenology. Thus, we provide a list of photography guidelines that could be included in community science platforms to inform community scientists in the best practices for creating images that facilitate phenological analysis.
Methods Creating a training and validation image set
We downloaded 40,761 research-grade observations of A. petiolata from iNaturalist, ranging from 1995 to 2020. Observations on the iNaturalist platform are considered “research-grade if the observation is verifiable (includes image), includes the date and location observed, is growing wild (i.e. not cultivated), and at least two-thirds of community users agree on the species identification. From this dataset, we used a subset of images for model training. The total number of observations in the iNaturalist dataset are heavily skewed towards more recent years. Less than 5% of the images we downloaded (n=1,790) were uploaded between 1995-2016, while over 50% of the images were uploaded in 2020. To mitigate temporal bias, we used all available images between the years 1995 and 2016 and we randomly selected images uploaded between 2017-2020. We restricted the number of randomly-selected images in 2020 by capping the number of 2020 images to approximately the number of 2019 observations in the training set. The annotated observation records are available in the supplement (supplementary data sheet 1). The majority of the unprocessed records (those which hold a CC-BY-NC license) are also available on GBIF.org (2021).
One of us (R. Reeb) annotated the phenology of training and validation set images using two different classification schemes: two-stage (non-flowering, flowering) and four-stage (vegetative, budding, flowering, fruiting). For the two-stage scheme, we classified 12,277 images and designated images as ‘flowering’ if there was one or more open flowers on the plant. All other images were classified as non-flowering. For the four-stage scheme, we classified 12,758 images. We classified images as ‘vegetative’ if no reproductive parts were present, ‘budding’ if one or more unopened flower buds were present, ‘flowering’ if at least one opened flower was present, and ‘fruiting’ if at least one fully-formed fruit was present (with no remaining flower petals attached at the base). Phenology categories were discrete; if there was more than one type of reproductive organ on the plant, the image was labeled based on the latest phenophase (e.g. if both flowers and fruits were present, the image was classified as fruiting).
For both classification schemes, we only included images in the model training and validation dataset if the image contained one or more plants with clearly visible reproductive parts were clear and we could exclude the possibility of a later phenophase. We removed 1.6% of images from the two-stage dataset that did not meet this requirement, leaving us with a total of 12,077 images, and 4.0% of the images from the four-stage leaving us with a total of 12,237 images. We then split the two-stage and four-stage datasets into a model training dataset (80% of each dataset) and a validation dataset (20% of each dataset).
Training a two-stage and four-stage CNN
We adapted techniques from studies applying machine learning to herbarium specimens for use with community science images (Lorieul et al. 2019; Pearson et al. 2020). We used transfer learning to speed up training of the model and reduce the size requirements for our labeled dataset. This approach uses a model that has been pre-trained using a large dataset and so is already competent at basic tasks such as detecting lines and shapes in images. We trained a neural network (ResNet-18) using the Pytorch machine learning library (Psake et al. 2019) within Python. We chose the ResNet-18 neural network because it had fewer convolutional layers and thus was less computationally intensive than pre-trained neural networks with more layers. In early testing we reached desired accuracy with the two-stage model using ResNet-18. ResNet-18 was pre-trained using the ImageNet dataset, which has 1,281,167 images for training (Deng et al. 2009). We utilized default parameters for batch size (4), learning rate (0.001), optimizer (stochastic gradient descent), and loss function (cross entropy loss). Because this led to satisfactory performance, we did not further investigate hyperparameters.
Because the ImageNet dataset has 1,000 classes while our data was labeled with either 2 or 4 classes, we replaced the final fully-connected layer of the ResNet-18 architecture with fully-connected layers containing an output size of 2 for the 2-class problem and 4 for the 4-class problem. We resized and cropped the images to fit ResNet’s input size of 224x224 pixels and normalized the distribution of the RGB values in each image to a mean of zero and a standard deviation of one, to simplify model calculations. During training, the CNN makes predictions on the labeled data from the training set and calculates a loss parameter that quantifies the model’s inaccuracy. The slope of the loss in relation to model parameters is found and then the model parameters are updated to minimize the loss value. After this training step, model performance is estimated by making predictions on the validation dataset. The model is not updated during this process, so that the validation data remains ‘unseen’ by the model (Rawat and Wang 2017; Tetko et al. 1995). This cycle is repeated until the desired level of accuracy is reached. We trained our model for 25 of these cycles, or epochs. We stopped training at 25 epochs to prevent overfitting, where the model becomes trained too specifically for the training images and begins to lose accuracy on images in the validation dataset (Tetko et al. 1995).
We evaluated model accuracy and created confusion matrices using the model’s predictions on the labeled validation data. This allowed us to evaluate the model’s accuracy and which specific categories are the most difficult for the model to distinguish. For using the model to make phenology predictions on the full, 40,761 image dataset, we created a custom dataloader function in Pytorch using the Custom Dataset function, which would allow for loading images listed in a csv and passing them through the model associated with unique image IDs.
Hardware information
Model training was conducted using a personal laptop (Ryzen 5 3500U cpu and 8 GB of memory) and a desktop computer (Ryzen 5 3600 cpu, NVIDIA RTX 3070 GPU and 16 GB of memory).
Comparing CNN accuracy to human annotation accuracy
We compared the accuracy of the trained CNN to the accuracy of seven inexperienced human scorers annotating a random subsample of 250 images from the full, 40,761 image dataset. An expert annotator (R. Reeb, who has over a year’s experience in annotating A. petiolata phenology) first classified the subsample images using the four-stage phenology classification scheme (vegetative, budding, flowering, fruiting). Nine images could not be classified for phenology and were removed. Next, seven non-expert annotators classified the 241 subsample images using an identical protocol. This group represented a variety of different levels of familiarity with A. petiolata phenology, ranging from no research experience to extensive research experience (two or more years working with this species). However, no one in the group had substantial experience classifying community science images and all were naïve to the four-stage phenology scoring protocol. The trained CNN was also used to classify the subsample images. We compared human annotation accuracy in each phenophase to the accuracy of the CNN using students
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
By Reddit [source]
This dataset provides an in-depth exploration of the world of online dating, based on data mined from Reddit's Tinder subreddit. Through analysis of the six columns titled title, score, id, url, comms_num and created (which include information such as social norms and user behaviors related to online dating), this dataset can teach us valuable insights into how people are engaging with digital media and their attitudes towards it. Unveiling potential dangers such as safety risks and scams that can arise from online dating activities is also possible with this data. Its findings are paramount for anyone interested in understanding how relationships develop on a digital platform – both for researchers uncovering the sociotechnical aspects of online dating behavior and for companies seeking further insight into their user's perspectives. All in all, this dataset might just hold all the missing pieces to understanding our current relationship dynamic!
For more datasets, click here.
- 🚨 Your notebook can be here! 🚨!
This dataset provides a comprehensive overview of online dating trends and behaviors observed on Reddit's Tinder subreddit. This data can be used to analyze user opinions, investigate user experiences, and discover online dating trends. To utilize this dataset effectively, there are several steps an individual can take to gain insights from the data:
- Using the dataset to examine how online dating trends vary geographically and by demographics (gender, age, race etc.)
- Analyzing the language used in posts for insights into user attitudes towards online dating.
- Creating a machine learning model to predict a post's score based on its title, body and other features of the data set can help digital media companies better target their marketing efforts towards more successful posts on Tinder subreddits
If you use this dataset in your research, please credit the original authors. Data Source
License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.
File: Tinder.csv | Column name | Description | |:--------------|:--------------------------------------------------------| | title | The title of the post. (String) | | score | The number of upvotes the post has received. (Integer) | | url | The URL of the post. (String) | | comms_num | The number of comments the post has received. (Integer) | | created | The date and time the post was created. (DateTime) | | body | The body of the post. (String) | | timestamp | The timestamp of the post. (Integer) |
If you use this dataset in your research, please credit the original authors. If you use this dataset in your research, please credit Reddit.
Facebook
TwitterProgram of vessel mount ADCP measurements comprising a combination of 300kHz and 75kHz ADCP data collected in the vicinity of the Loop Current and drilling blocks between 2004 and 2007. _NCProperties=version=2,netcdf=4.7.4,hdf5=1.12.0, acknowledgement=Data collection funded by various oil industry operators cdm_data_type=TrajectoryProfile cdm_profile_variables=time cdm_trajectory_variables=trajectory CODAS_processing_note=
The CODAS database is a specialized storage format designed for shipboard ADCP data. "CODAS processing" uses this format to hold averaged shipboard ADCP velocities and other variables, during the stages of data processing. The CODAS database stores velocity profiles relative to the ship as east and north components along with position, ship speed, heading, and other variables. The netCDF short form contains ocean velocities relative to earth, time, position, transducer temperature, and ship heading; these are designed to be "ready for immediate use". The netCDF long form is just a dump of the entire CODAS database. Some variables are no longer used, and all have names derived from their original CODAS names, dating back to the late 1980's.
CODAS post-processing, i.e. that which occurs after the single-ping profiles have been vector-averaged and loaded into the CODAS database, includes editing (using automated algorithms and manual tools), rotation and scaling of the measured velocities, and application of a time-varying heading correction. Additional algorithms developed more recently include translation of the GPS positions to the transducer location, and averaging of ship's speed over the times of valid pings when Percent Good is reduced. Such post-processing is needed prior to submission of "processed ADCP data" to JASADCP or other archives.
Whenever single-ping data have been recorded, full CODAS processing provides the best end product.
Full CODAS processing starts with the single-ping velocities in beam coordinates. Based on the transducer orientation relative to the hull, the beam velocities are transformed to horizontal, vertical, and "error velocity" components. Using a reliable heading (typically from the ship's gyro compass), the velocities in ship coordinates are rotated into earth coordinates.
Pings are grouped into an "ensemble" (usually 2-5 minutes duration) and undergo a suite of automated editing algorithms (removal of acoustic interference; identification of the bottom; editing based on thresholds; and specialized editing that targets CTD wire interference and "weak, biased profiles". The ensemble of single-ping velocities is then averaged using an iterative reference layer averaging scheme. Each ensemble is approximated as a single function of depth, with a zero-average over a reference layer plus a reference layer velocity for each ping. Adding the average of the single-ping reference layer velocities to the function of depth yields the ensemble-average velocity profile. These averaged profiles, along with ancillary measurements, are written to disk, and subsequently loaded into the CODAS database. Everything after this stage is "post-processing".
Time is stored in the database using UTC Year, Month, Day, Hour, Minute, Seconds. Floating point time "Decimal Day" is the floating point interval in days since the start of the year, usually the year of the first day of the cruise.
CODAS processing uses heading from a reliable device, and (if available) uses a time-dependent correction by an accurate heading device. The reliable heading device is typically a gyro compass (for example, the Bridge gyro). Accurate heading devices can be POSMV, Seapath, Phins, Hydrins, MAHRS, or various Ashtech devices; this varies with the technology of the time. It is always confusing to keep track of the sign of the heading correction. Headings are written degrees, positive clockwise. setting up some variables:
X = transducer angle (CONFIG1_heading_bias) positive clockwise (beam 3 angle relative to ship) G = Reliable heading (gyrocompass) A = Accurate heading dh = G - A = time-dependent heading correction (ANCIL2_watrk_hd_misalign)
Rotation of the measured velocities into the correct coordinate system amounts to (u+i*v)*(exp(i*theta)) where theta is the sum of the corrected heading and the transducer angle.
theta = X + (G - dh) = X + G - dh
Watertrack and Bottomtrack calibrations give an indication of the residual angle offset to apply, for example if mean and median of the phase are all 0.5 (then R=0.5). Using the "rotate" command, the value of R is added to "ANCIL2_watrk_hd_misalign".
new_dh = dh + R
Therefore the total angle used in rotation is
new_theta = X + G - dh_new = X + G - (dh + R) = (X - R) + (G - dh)
The new estimate of the transducer angle is: X - R ANCIL2_watrk_hd_misalign contains: dh + R
====================================================
Profile editing flags are provided for each depth cell:
binary decimal below Percent value value bottom Good bin -------+----------+--------+----------+-------+ 000 0 001 1 bad 010 2 bad 011 3 bad bad 100 4 bad 101 5 bad bad 110 6 bad bad 111 7 bad bad bad -------+----------+--------+----------+-------+
CODAS_variables= Variables in this CODAS short-form Netcdf file are intended for most end-user scientific analysis and display purposes. For additional information see the CODAS_processing_note global attribute and the attributes of each of the variables.
============= ================================================================= time Time at the end of the ensemble, days from start of year. lon, lat Longitude, Latitude from GPS at the end of the ensemble. u,v Ocean zonal and meridional velocity component profiles. uship, vship Zonal and meridional velocity components of the ship. heading Mean ship heading during the ensemble. depth Bin centers in nominal meters (no sound speed profile correction). tr_temp ADCP transducer temperature. pg Percent Good pings for u, v averaging after editing. pflag Profile Flags based on editing, used to mask u, v. amp Received signal strength in ADCP-specific units; no correction for spreading or attenuation. ============= =================================================================
contributor_name=RPS contributor_role=editor contributor_role_vocabulary=https://vocab.nerc.ac.uk/collection/G04/current/ Conventions=CF-1.6, ACDD-1.3, IOOS Metadata Profile Version 1.2, COARDS cruise_id=Fugro_wh75 description=Shipboard ADCP velocity profiles from Fugro_wh75 using instrument wh75 Easternmost_Easting=-91.18131944444445 featureType=TrajectoryProfile geospatial_bounds=LINESTRING (-91.47516388888891 26.999433333333332, -91.18131944444445 27.000597222222222) geospatial_bounds_crs=EPSG:4326 geospatial_bounds_vertical_crs=EPSG:5703 geospatial_lat_max=27.000597222222222 geospatial_lat_min=26.999433333333332 geospatial_lat_units=degrees_north geospatial_lon_max=-91.18131944444445 geospatial_lon_min=-91.47516388888891 geospatial_lon_units=degrees_east geospatial_vertical_max=600.76 geospatial_vertical_min=16.76 geospatial_vertical_positive=down geospatial_vertical_units=m hg_changeset=2924:48293b7d29a9 history=Created: 2019-07-15 17:48:18 UTC id=C16185_075_Line1388_0 infoUrl=ADD ME institution=GCOOS instrument=In Situ/Laboratory Instruments > Profilers/Sounders > Acoustic Sounders > ADCP > Acoustic Doppler Current Profiler keywords_vocabulary=GCMD Science Keywords naming_authority=edu.tamucc.gulfhub Northernmost_Northing=27.000597222222222 platform=ship platform_vocabulary=https://mmisw.org/ont/ioos/platform processing_level=QA'ed and checked by Oceanographer program=Oil and Gas Loop Current VMADCP Program project=O&G LC VMADCP Program software=pycurrents sonar=wh75 source=Current profiler sourceUrl=(local files) Southernmost_Northing=26.999433333333332 standard_name_vocabulary=CF Standard Name Table v67 subsetVariables=time, longitude, latitude, depth, u, v time_coverage_duration=P0Y0M0DT3H34M46S time_coverage_end=2006-08-22T01:10:58Z time_coverage_resolution=P0Y0M0DT0H4M59S time_coverage_start=2006-08-21T21:36:12Z Westernmost_Easting=-91.47516388888891 yearbase=2006
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
For nearly 30 years, ArXiv has served the public and research communities by providing open access to scholarly articles, from the vast branches of physics to the many subdisciplines of computer science to everything in between, including math, statistics, electrical engineering, quantitative biology, and economics. This rich corpus of information offers significant, but sometimes overwhelming depth.
In these times of unique global challenges, efficient extraction of insights from data is essential. To help make the arXiv more accessible, we present a free, open pipeline on Kaggle to the machine-readable arXiv dataset: a repository of 1.7 million articles, with relevant features such as article titles, authors, categories, abstracts, full text PDFs, and more.
Our hope is to empower new use cases that can lead to the exploration of richer machine learning techniques that combine multi-modal features towards applications like trend analysis, paper recommender engines, category prediction, co-citation networks, knowledge graph construction and semantic search interfaces.
The dataset is freely available via Google Cloud Storage buckets (more info here). Stay tuned for weekly updates to the dataset!
ArXiv is a collaboratively funded, community-supported resource founded by Paul Ginsparg in 1991 and maintained and operated by Cornell University.
The release of this dataset was featured further in a Kaggle blog post here.
https://storage.googleapis.com/kaggle-public-downloads/arXiv.JPG" alt="">
See here for more information.
This dataset is a mirror of the original ArXiv data. Because the full dataset is rather large (1.1TB and growing), this dataset provides only a metadata file in the json format. This file contains an entry for each paper, containing:
- id: ArXiv ID (can be used to access the paper, see below)
- submitter: Who submitted the paper
- authors: Authors of the paper
- title: Title of the paper
- comments: Additional info, such as number of pages and figures
- journal-ref: Information about the journal the paper was published in
- doi: https://www.doi.org
- abstract: The abstract of the paper
- categories: Categories / tags in the ArXiv system
- versions: A version history
You can access each paper directly on ArXiv using these links:
- https://arxiv.org/abs/{id}: Page for this paper including its abstract and further links
- https://arxiv.org/pdf/{id}: Direct link to download the PDF
The full set of PDFs is available for free in the GCS bucket gs://arxiv-dataset or through Google API (json documentation and xml documentation).
You can use for example gsutil to download the data to your local machine. ```
gsutil cp gs://arxiv-dataset/arxiv/
gsutil cp gs://arxiv-dataset/arxiv/arxiv/pdf/2003/ ./a_local_directory/
gsutil cp -r gs://arxiv-dataset/arxiv/ ./a_local_directory/ ```
We're automatically updating the metadata as well as the GCS bucket on a weekly basis.
Creative Commons CC0 1.0 Universal Public Domain Dedication applies to the metadata in this dataset. See https://arxiv.org/help/license for further details and licensing on individual papers.
The original data is maintained by ArXiv, huge thanks to the team for building and maintaining this dataset.
We're using https://github.com/mattbierbaum/arxiv-public-datasets to pull the original data, thanks to Matt Bierbaum for providing this tool.
Facebook
TwitterProgram of vessel mount ADCP measurements comprising a combination of 300kHz and 75kHz ADCP data collected in the vicinity of the Loop Current and drilling blocks between 2004 and 2007. _NCProperties=version=2,netcdf=4.7.4,hdf5=1.12.0, acknowledgement=Data collection funded by various oil industry operators cdm_data_type=TrajectoryProfile cdm_profile_variables=time cdm_trajectory_variables=trajectory CODAS_processing_note=
The CODAS database is a specialized storage format designed for shipboard ADCP data. "CODAS processing" uses this format to hold averaged shipboard ADCP velocities and other variables, during the stages of data processing. The CODAS database stores velocity profiles relative to the ship as east and north components along with position, ship speed, heading, and other variables. The netCDF short form contains ocean velocities relative to earth, time, position, transducer temperature, and ship heading; these are designed to be "ready for immediate use". The netCDF long form is just a dump of the entire CODAS database. Some variables are no longer used, and all have names derived from their original CODAS names, dating back to the late 1980's.
CODAS post-processing, i.e. that which occurs after the single-ping profiles have been vector-averaged and loaded into the CODAS database, includes editing (using automated algorithms and manual tools), rotation and scaling of the measured velocities, and application of a time-varying heading correction. Additional algorithms developed more recently include translation of the GPS positions to the transducer location, and averaging of ship's speed over the times of valid pings when Percent Good is reduced. Such post-processing is needed prior to submission of "processed ADCP data" to JASADCP or other archives.
Whenever single-ping data have been recorded, full CODAS processing provides the best end product.
Full CODAS processing starts with the single-ping velocities in beam coordinates. Based on the transducer orientation relative to the hull, the beam velocities are transformed to horizontal, vertical, and "error velocity" components. Using a reliable heading (typically from the ship's gyro compass), the velocities in ship coordinates are rotated into earth coordinates.
Pings are grouped into an "ensemble" (usually 2-5 minutes duration) and undergo a suite of automated editing algorithms (removal of acoustic interference; identification of the bottom; editing based on thresholds; and specialized editing that targets CTD wire interference and "weak, biased profiles". The ensemble of single-ping velocities is then averaged using an iterative reference layer averaging scheme. Each ensemble is approximated as a single function of depth, with a zero-average over a reference layer plus a reference layer velocity for each ping. Adding the average of the single-ping reference layer velocities to the function of depth yields the ensemble-average velocity profile. These averaged profiles, along with ancillary measurements, are written to disk, and subsequently loaded into the CODAS database. Everything after this stage is "post-processing".
Time is stored in the database using UTC Year, Month, Day, Hour, Minute, Seconds. Floating point time "Decimal Day" is the floating point interval in days since the start of the year, usually the year of the first day of the cruise.
CODAS processing uses heading from a reliable device, and (if available) uses a time-dependent correction by an accurate heading device. The reliable heading device is typically a gyro compass (for example, the Bridge gyro). Accurate heading devices can be POSMV, Seapath, Phins, Hydrins, MAHRS, or various Ashtech devices; this varies with the technology of the time. It is always confusing to keep track of the sign of the heading correction. Headings are written degrees, positive clockwise. setting up some variables:
X = transducer angle (CONFIG1_heading_bias) positive clockwise (beam 3 angle relative to ship) G = Reliable heading (gyrocompass) A = Accurate heading dh = G - A = time-dependent heading correction (ANCIL2_watrk_hd_misalign)
Rotation of the measured velocities into the correct coordinate system amounts to (u+i*v)*(exp(i*theta)) where theta is the sum of the corrected heading and the transducer angle.
theta = X + (G - dh) = X + G - dh
Watertrack and Bottomtrack calibrations give an indication of the residual angle offset to apply, for example if mean and median of the phase are all 0.5 (then R=0.5). Using the "rotate" command, the value of R is added to "ANCIL2_watrk_hd_misalign".
new_dh = dh + R
Therefore the total angle used in rotation is
new_theta = X + G - dh_new = X + G - (dh + R) = (X - R) + (G - dh)
The new estimate of the transducer angle is: X - R ANCIL2_watrk_hd_misalign contains: dh + R
====================================================
Profile editing flags are provided for each depth cell:
binary decimal below Percent value value bottom Good bin -------+----------+--------+----------+-------+ 000 0 001 1 bad 010 2 bad 011 3 bad bad 100 4 bad 101 5 bad bad 110 6 bad bad 111 7 bad bad bad -------+----------+--------+----------+-------+
CODAS_variables= Variables in this CODAS short-form Netcdf file are intended for most end-user scientific analysis and display purposes. For additional information see the CODAS_processing_note global attribute and the attributes of each of the variables.
============= ================================================================= time Time at the end of the ensemble, days from start of year. lon, lat Longitude, Latitude from GPS at the end of the ensemble. u,v Ocean zonal and meridional velocity component profiles. uship, vship Zonal and meridional velocity components of the ship. heading Mean ship heading during the ensemble. depth Bin centers in nominal meters (no sound speed profile correction). tr_temp ADCP transducer temperature. pg Percent Good pings for u, v averaging after editing. pflag Profile Flags based on editing, used to mask u, v. amp Received signal strength in ADCP-specific units; no correction for spreading or attenuation. ============= =================================================================
contributor_name=RPS contributor_role=editor contributor_role_vocabulary=https://vocab.nerc.ac.uk/collection/G04/current/ Conventions=CF-1.6, ACDD-1.3, IOOS Metadata Profile Version 1.2, COARDS cruise_id=Fugro_wh300 description=Shipboard ADCP velocity profiles from Fugro_wh300 using instrument wh300 Easternmost_Easting=-91.18141111111112 featureType=TrajectoryProfile geospatial_bounds=LINESTRING (-91.47513888888886 26.999433333333332, -91.18141111111112 27.0006) geospatial_bounds_crs=EPSG:4326 geospatial_bounds_vertical_crs=EPSG:5703 geospatial_lat_max=27.0006 geospatial_lat_min=26.999433333333332 geospatial_lat_units=degrees_north geospatial_lon_max=-91.18141111111112 geospatial_lon_min=-91.47513888888886 geospatial_lon_units=degrees_east geospatial_vertical_max=123.42 geospatial_vertical_min=7.42 geospatial_vertical_positive=down geospatial_vertical_units=m hg_changeset=2924:48293b7d29a9 history=Created: 2019-07-15 17:45:44 UTC id=C16185_300_Line1388_0 infoUrl=ADD ME institution=GCOOS instrument=In Situ/Laboratory Instruments > Profilers/Sounders > Acoustic Sounders > ADCP > Acoustic Doppler Current Profiler keywords_vocabulary=GCMD Science Keywords naming_authority=edu.tamucc.gulfhub Northernmost_Northing=27.0006 platform=ship platform_vocabulary=https://mmisw.org/ont/ioos/platform processing_level=QA'ed and checked by Oceanographer program=Oil and Gas Loop Current VMADCP Program project=O&G LC VMADCP Program software=pycurrents sonar=wh300 source=Current profiler sourceUrl=(local files) Southernmost_Northing=26.999433333333332 standard_name_vocabulary=CF Standard Name Table v67 subsetVariables=time, longitude, latitude, depth, u, v time_coverage_duration=P0Y0M0DT3H34M42S time_coverage_end=2006-08-22T01:10:57Z time_coverage_resolution=P0Y0M0DT0H5M0S time_coverage_start=2006-08-21T21:36:15Z Westernmost_Easting=-91.47513888888886 yearbase=2006
Facebook
TwitterAttribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
A beginner-friendly version of the MIT-BIH Arrhythmia Database, which contains 48 electrocardiograms (EKGs) from 47 patients that were at Beth Israel Deaconess Medical Center in Boston, MA in 1975-1979.
This data was updated to a new format on 7/18/2025 with new filenames. Now heartbeats are labeled and their annotations are in new CSV and JSON files. This means that each patient's EKG file is now named {id}_ekg.csv and they have accompanying heartbeat annotation files, named {id}_annotations.csv. For example, if your code used to open 100.csv, it should be changed to opening 100_ekg.csv.
Each of the 48 EKGs has the following files (using patient 100 as an example):
- 100_ekg.csv - a 30-minute EKG recording from one patient with 2 EKG channels. This also contains annotations (the symbol column), where doctors have marked and classified heartbeats as normal or abnormal.
- 100_ekg.json - the 30-minute EKG with all of its metadata. It has all of the same data as the CSV file in addition to frequency/sample rate info and more.
- 100_annotations.csv - the labels for the heartbeats, where doctors have manually classified each heartbeat as normal as one of dozens of types of arrhythmias. There may be multiple of these files (number 1, 2, or 3), since the original MIT-BIH Arrhythmia Database had multiple .atr files for some patients. The MIT-BIH DB did not elaborate on why, though the differences between each annotation file seems to be only a few lines at most.
- 100_annotations.json - the annotation file that is as close to the original as possible, keeping all of its metadata, while being an easy to use JSON file (as opposed to an .atr file, which requires the WFDB library to open).
Other files:
- annotation_symbols.csv - contains the meanings of the annotation symbols
There are 48 EKGs for 47 patients, each of which is a 30-minute echocardiogram (EKG) from a single patient. (Record 201 and 202 are from the same patient). Data was collected at 360 Hz, meaning that 360 data points is equal to 1 second of time.
Each file's name starts with the ID of the patient (except for 201 and 202, which are the same person).
The P-waves were labeled by doctors and technicians, and their exact indices are available in the accompanying dataset, MIT-BIH Arrhythmia Database P-wave Annotations.
EKGs, or electrocardiograms, measure the heart's function by looking at its electrical activity. The electrical activity in each part of the heart is supposed to happen in a particular order and intensity, creating that classic "heartbeat" line (or "QRS complex") you see on monitors in medical TV shows.
There are a few types of EKGs (4-lead, 5-lead, 12-lead, etc.), which give us varying detail about the heart. A 12-lead is one of the most detailed types of EKGs, as it allows us to get 12 different outputs or graphs, all looking at different, specific parts of the heart muscles.
This dataset only publishes two leads from each patient's 12-lead EKG, since that is all that the original MIT-BIH database provided.
Check out Ninja Nerd's EKG Basics tutorial on YouTube to understand what each part of the QRS complex (or heartbeat) means from an electrical standpoint.
The two leads are often lead MLII and another lead such as V1, V2, or V5, though some datasets do not use MLII at all. MLII is the lead most often associated with the classic QRS Complex (the medical name for a single heartbeat).
Info about [each of the 47 patients is available here](https://physionet.org/phys...
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset contains the results of real-time PCR testing for COVID-19 in Mexico as reported by the [General Directorate of Epidemiology](https://www.gob.mx/salud/documentos/datos-abiertos-152127).
The official, raw dataset is available in the Official Secretary of Epidemiology website: https://www.gob.mx/salud/documentos/datos-abiertos-152127.
You might also want to download the official column descriptors and the variable definitions - e.g. SEXO=1 -> Female; SEXO=2 -> Male; SEXO=99 -> Undisclosed) - in the following [zip file](http://datosabiertos.salud.gob.mx/gobmx/salud/datos_abiertos/diccionario_datos_covid19.zip). I've maintained the original levels as described in the official dataset, unless otherwise specified.
IMPORTANT: This dataset has been maintained since the original data releases, which weren't tabular, but rather consisted of PDF files, often with many/different inconsistencies which had to be resolved carefully and is annotated in the .R script. More later datasets should be more reliable, but earlier there were a lot of things to figure out like e.g. when the official methodology to assign the region of the case was changed to be based on residence rather than origin). I've added more notes on very early data here: https://github.com/marianarf/covid19_mexico_data.
[More official information here](https://datos.gob.mx/busca/dataset/informacion-referente-a-casos-covid-19-en-mexico/resource/e8c7079c-dc2a-4b6e-8035-08042ed37165).
I hope that this data serves to as a base to understand the clinical symptoms 🔬that characterize a COVID-19 positive case from another viral respiratory disease and help expand the knowledge about COVID-19 worldwide.
👩🔬🧑🔬🧪With more models tested, added features and fine-tuning, clinical data could be used to predict a patient with pending COVID-19 results will get a positive or a negative result in two scenarios:
The value of the lab result comes from a RT-PCR, and is stored in RESULTADO, where the original data is encoded 1 = POSITIVE and 2 = NEGATIVE.
The data was gathered using a "sentinel model" that samples 10% of the patients that present a viral respiratory diagnosis to test for COVID-19, and consists of data reported by 475 viral respiratory disease monitoring units (hospitals) named USMER (Unidades Monitoras de Enfermedad Respiratoria Viral) throughout the country in the entire health sector (IMSS, ISSSTE, SEDENA, SEMAR, and others).
Data is first processed with this [this .R script](https://github.com/marianarf/covid19_mexico_analysis/blob/master/notebooks/preprocess.R). The file containing the processed data will be updated daily until. Important: Since the data is updated to Github, assume the data uploaded here isn't the latest version, and instead, load data directly from the 'csv' [in this github repository](https://raw.githubusercontent.com/marianarf/covid19_mexico_analysis/master/mexico_covid19.csv).
'ID_REGISTRO' as well as a (new) unique reference 'id' to remove duplicates.ENTIDAD_UM (the region of the medical unit) but now uses ENTIDAD_RES (the region of residence of the patient).In addition to original features reported, I've included missing regional names and also a field 'DELAY' which corresponds to the lag in the processing lab results (since new data contains records from the previous day, this allows to keep track of this lag).
...
Facebook
TwitterProgram of vessel mount ADCP measurements comprising a combination of 300kHz and 75kHz ADCP data collected in the vicinity of the Loop Current and drilling blocks between 2004 and 2007. _NCProperties=version=2,netcdf=4.7.4,hdf5=1.12.0, acknowledgement=Data collection funded by various oil industry operators cdm_data_type=TrajectoryProfile cdm_profile_variables=time cdm_trajectory_variables=trajectory CODAS_processing_note=
The CODAS database is a specialized storage format designed for shipboard ADCP data. "CODAS processing" uses this format to hold averaged shipboard ADCP velocities and other variables, during the stages of data processing. The CODAS database stores velocity profiles relative to the ship as east and north components along with position, ship speed, heading, and other variables. The netCDF short form contains ocean velocities relative to earth, time, position, transducer temperature, and ship heading; these are designed to be "ready for immediate use". The netCDF long form is just a dump of the entire CODAS database. Some variables are no longer used, and all have names derived from their original CODAS names, dating back to the late 1980's.
CODAS post-processing, i.e. that which occurs after the single-ping profiles have been vector-averaged and loaded into the CODAS database, includes editing (using automated algorithms and manual tools), rotation and scaling of the measured velocities, and application of a time-varying heading correction. Additional algorithms developed more recently include translation of the GPS positions to the transducer location, and averaging of ship's speed over the times of valid pings when Percent Good is reduced. Such post-processing is needed prior to submission of "processed ADCP data" to JASADCP or other archives.
Whenever single-ping data have been recorded, full CODAS processing provides the best end product.
Full CODAS processing starts with the single-ping velocities in beam coordinates. Based on the transducer orientation relative to the hull, the beam velocities are transformed to horizontal, vertical, and "error velocity" components. Using a reliable heading (typically from the ship's gyro compass), the velocities in ship coordinates are rotated into earth coordinates.
Pings are grouped into an "ensemble" (usually 2-5 minutes duration) and undergo a suite of automated editing algorithms (removal of acoustic interference; identification of the bottom; editing based on thresholds; and specialized editing that targets CTD wire interference and "weak, biased profiles". The ensemble of single-ping velocities is then averaged using an iterative reference layer averaging scheme. Each ensemble is approximated as a single function of depth, with a zero-average over a reference layer plus a reference layer velocity for each ping. Adding the average of the single-ping reference layer velocities to the function of depth yields the ensemble-average velocity profile. These averaged profiles, along with ancillary measurements, are written to disk, and subsequently loaded into the CODAS database. Everything after this stage is "post-processing".
Time is stored in the database using UTC Year, Month, Day, Hour, Minute, Seconds. Floating point time "Decimal Day" is the floating point interval in days since the start of the year, usually the year of the first day of the cruise.
CODAS processing uses heading from a reliable device, and (if available) uses a time-dependent correction by an accurate heading device. The reliable heading device is typically a gyro compass (for example, the Bridge gyro). Accurate heading devices can be POSMV, Seapath, Phins, Hydrins, MAHRS, or various Ashtech devices; this varies with the technology of the time. It is always confusing to keep track of the sign of the heading correction. Headings are written degrees, positive clockwise. setting up some variables:
X = transducer angle (CONFIG1_heading_bias) positive clockwise (beam 3 angle relative to ship) G = Reliable heading (gyrocompass) A = Accurate heading dh = G - A = time-dependent heading correction (ANCIL2_watrk_hd_misalign)
Rotation of the measured velocities into the correct coordinate system amounts to (u+i*v)*(exp(i*theta)) where theta is the sum of the corrected heading and the transducer angle.
theta = X + (G - dh) = X + G - dh
Watertrack and Bottomtrack calibrations give an indication of the residual angle offset to apply, for example if mean and median of the phase are all 0.5 (then R=0.5). Using the "rotate" command, the value of R is added to "ANCIL2_watrk_hd_misalign".
new_dh = dh + R
Therefore the total angle used in rotation is
new_theta = X + G - dh_new = X + G - (dh + R) = (X - R) + (G - dh)
The new estimate of the transducer angle is: X - R ANCIL2_watrk_hd_misalign contains: dh + R
====================================================
Profile editing flags are provided for each depth cell:
binary decimal below Percent value value bottom Good bin -------+----------+--------+----------+-------+ 000 0 001 1 bad 010 2 bad 011 3 bad bad 100 4 bad 101 5 bad bad 110 6 bad bad 111 7 bad bad bad -------+----------+--------+----------+-------+
CODAS_variables= Variables in this CODAS short-form Netcdf file are intended for most end-user scientific analysis and display purposes. For additional information see the CODAS_processing_note global attribute and the attributes of each of the variables.
============= ================================================================= time Time at the end of the ensemble, days from start of year. lon, lat Longitude, Latitude from GPS at the end of the ensemble. u,v Ocean zonal and meridional velocity component profiles. uship, vship Zonal and meridional velocity components of the ship. heading Mean ship heading during the ensemble. depth Bin centers in nominal meters (no sound speed profile correction). tr_temp ADCP transducer temperature. pg Percent Good pings for u, v averaging after editing. pflag Profile Flags based on editing, used to mask u, v. amp Received signal strength in ADCP-specific units; no correction for spreading or attenuation. ============= =================================================================
contributor_name=RPS contributor_role=editor contributor_role_vocabulary=https://vocab.nerc.ac.uk/collection/G04/current/ Conventions=CF-1.6, ACDD-1.3, IOOS Metadata Profile Version 1.2, COARDS cruise_id=Fugro_wh75 description=Shipboard ADCP velocity profiles from Fugro_wh75 using instrument wh75 Easternmost_Easting=-89.82341944444443 featureType=TrajectoryProfile geospatial_bounds=LINESTRING (-89.94420555555558 27.253375, -89.82341944444443 27.25493888888889) geospatial_bounds_crs=EPSG:4326 geospatial_bounds_vertical_crs=EPSG:5703 geospatial_lat_max=27.25493888888889 geospatial_lat_min=27.253375 geospatial_lat_units=degrees_north geospatial_lon_max=-89.82341944444443 geospatial_lon_min=-89.94420555555558 geospatial_lon_units=degrees_east geospatial_vertical_max=651.83 geospatial_vertical_min=27.83 geospatial_vertical_positive=down geospatial_vertical_units=m hg_changeset=2924:48293b7d29a9 history=Created: 2019-07-15 17:47:36 UTC id=C16185_075_Line1054_0 infoUrl=ADD ME institution=GCOOS instrument=In Situ/Laboratory Instruments > Profilers/Sounders > Acoustic Sounders > ADCP > Acoustic Doppler Current Profiler keywords_vocabulary=GCMD Science Keywords naming_authority=edu.tamucc.gulfhub Northernmost_Northing=27.25493888888889 platform=ship platform_vocabulary=https://mmisw.org/ont/ioos/platform processing_level=QA'ed and checked by Oceanographer program=Oil and Gas Loop Current VMADCP Program project=O&G LC VMADCP Program software=pycurrents sonar=wh75 source=Current profiler sourceUrl=(local files) Southernmost_Northing=27.253375 standard_name_vocabulary=CF Standard Name Table v67 subsetVariables=time, longitude, latitude, depth, u, v time_coverage_duration=P0Y0M0DT1H16M8S time_coverage_end=2006-05-25T12:05:45Z time_coverage_resolution=P0Y0M0DT0H4M59S time_coverage_start=2006-05-25T10:49:37Z Westernmost_Easting=-89.94420555555558 yearbase=2006
Facebook
Twitter"Predict behavior to retain customers. You can analyze all relevant customer data and develop focused customer retention programs." [IBM Sample Data Sets]
Each row represents a customer, each column contains customer’s attributes described on the column Metadata.
The data set includes information about:
To explore this type of models and learn more about the subject.
New version from IBM: https://community.ibm.com/community/user/businessanalytics/blogs/steven-macko/2019/07/11/telco-customer-churn-1113
Facebook
TwitterProgram of vessel mount ADCP measurements comprising a combination of 300kHz and 75kHz ADCP data collected in the vicinity of the Loop Current and drilling blocks between 2004 and 2007. _NCProperties=version=2,netcdf=4.7.4,hdf5=1.12.0, acknowledgement=Data collection funded by various oil industry operators cdm_data_type=TrajectoryProfile cdm_profile_variables=time cdm_trajectory_variables=trajectory CODAS_processing_note=
The CODAS database is a specialized storage format designed for shipboard ADCP data. "CODAS processing" uses this format to hold averaged shipboard ADCP velocities and other variables, during the stages of data processing. The CODAS database stores velocity profiles relative to the ship as east and north components along with position, ship speed, heading, and other variables. The netCDF short form contains ocean velocities relative to earth, time, position, transducer temperature, and ship heading; these are designed to be "ready for immediate use". The netCDF long form is just a dump of the entire CODAS database. Some variables are no longer used, and all have names derived from their original CODAS names, dating back to the late 1980's.
CODAS post-processing, i.e. that which occurs after the single-ping profiles have been vector-averaged and loaded into the CODAS database, includes editing (using automated algorithms and manual tools), rotation and scaling of the measured velocities, and application of a time-varying heading correction. Additional algorithms developed more recently include translation of the GPS positions to the transducer location, and averaging of ship's speed over the times of valid pings when Percent Good is reduced. Such post-processing is needed prior to submission of "processed ADCP data" to JASADCP or other archives.
Whenever single-ping data have been recorded, full CODAS processing provides the best end product.
Full CODAS processing starts with the single-ping velocities in beam coordinates. Based on the transducer orientation relative to the hull, the beam velocities are transformed to horizontal, vertical, and "error velocity" components. Using a reliable heading (typically from the ship's gyro compass), the velocities in ship coordinates are rotated into earth coordinates.
Pings are grouped into an "ensemble" (usually 2-5 minutes duration) and undergo a suite of automated editing algorithms (removal of acoustic interference; identification of the bottom; editing based on thresholds; and specialized editing that targets CTD wire interference and "weak, biased profiles". The ensemble of single-ping velocities is then averaged using an iterative reference layer averaging scheme. Each ensemble is approximated as a single function of depth, with a zero-average over a reference layer plus a reference layer velocity for each ping. Adding the average of the single-ping reference layer velocities to the function of depth yields the ensemble-average velocity profile. These averaged profiles, along with ancillary measurements, are written to disk, and subsequently loaded into the CODAS database. Everything after this stage is "post-processing".
Time is stored in the database using UTC Year, Month, Day, Hour, Minute, Seconds. Floating point time "Decimal Day" is the floating point interval in days since the start of the year, usually the year of the first day of the cruise.
CODAS processing uses heading from a reliable device, and (if available) uses a time-dependent correction by an accurate heading device. The reliable heading device is typically a gyro compass (for example, the Bridge gyro). Accurate heading devices can be POSMV, Seapath, Phins, Hydrins, MAHRS, or various Ashtech devices; this varies with the technology of the time. It is always confusing to keep track of the sign of the heading correction. Headings are written degrees, positive clockwise. setting up some variables:
X = transducer angle (CONFIG1_heading_bias) positive clockwise (beam 3 angle relative to ship) G = Reliable heading (gyrocompass) A = Accurate heading dh = G - A = time-dependent heading correction (ANCIL2_watrk_hd_misalign)
Rotation of the measured velocities into the correct coordinate system amounts to (u+i*v)*(exp(i*theta)) where theta is the sum of the corrected heading and the transducer angle.
theta = X + (G - dh) = X + G - dh
Watertrack and Bottomtrack calibrations give an indication of the residual angle offset to apply, for example if mean and median of the phase are all 0.5 (then R=0.5). Using the "rotate" command, the value of R is added to "ANCIL2_watrk_hd_misalign".
new_dh = dh + R
Therefore the total angle used in rotation is
new_theta = X + G - dh_new = X + G - (dh + R) = (X - R) + (G - dh)
The new estimate of the transducer angle is: X - R ANCIL2_watrk_hd_misalign contains: dh + R
====================================================
Profile editing flags are provided for each depth cell:
binary decimal below Percent value value bottom Good bin -------+----------+--------+----------+-------+ 000 0 001 1 bad 010 2 bad 011 3 bad bad 100 4 bad 101 5 bad bad 110 6 bad bad 111 7 bad bad bad -------+----------+--------+----------+-------+
CODAS_variables= Variables in this CODAS short-form Netcdf file are intended for most end-user scientific analysis and display purposes. For additional information see the CODAS_processing_note global attribute and the attributes of each of the variables.
============= ================================================================= time Time at the end of the ensemble, days from start of year. lon, lat Longitude, Latitude from GPS at the end of the ensemble. u,v Ocean zonal and meridional velocity component profiles. uship, vship Zonal and meridional velocity components of the ship. heading Mean ship heading during the ensemble. depth Bin centers in nominal meters (no sound speed profile correction). tr_temp ADCP transducer temperature. pg Percent Good pings for u, v averaging after editing. pflag Profile Flags based on editing, used to mask u, v. amp Received signal strength in ADCP-specific units; no correction for spreading or attenuation. ============= =================================================================
contributor_name=RPS contributor_role=editor contributor_role_vocabulary=https://vocab.nerc.ac.uk/collection/G04/current/ Conventions=CF-1.6, ACDD-1.3, IOOS Metadata Profile Version 1.2, COARDS cruise_id=Fugro_wh300 description=Shipboard ADCP velocity profiles from Fugro_wh300 using instrument wh300 Easternmost_Easting=-90.05254444444444 featureType=TrajectoryProfile geospatial_bounds=LINESTRING (-90.64675833333331 26.964925, -90.05254444444444 27.216625) geospatial_bounds_crs=EPSG:4326 geospatial_bounds_vertical_crs=EPSG:5703 geospatial_lat_max=27.216625 geospatial_lat_min=26.964925 geospatial_lat_units=degrees_north geospatial_lon_max=-90.05254444444444 geospatial_lon_min=-90.64675833333331 geospatial_lon_units=degrees_east geospatial_vertical_max=123.42 geospatial_vertical_min=7.42 geospatial_vertical_positive=down geospatial_vertical_units=m hg_changeset=2924:48293b7d29a9 history=Created: 2019-07-15 17:46:18 UTC id=C16185_300_Line1292_0 infoUrl=ADD ME institution=GCOOS instrument=In Situ/Laboratory Instruments > Profilers/Sounders > Acoustic Sounders > ADCP > Acoustic Doppler Current Profiler keywords_vocabulary=GCMD Science Keywords naming_authority=edu.tamucc.gulfhub Northernmost_Northing=27.216625 platform=ship platform_vocabulary=https://mmisw.org/ont/ioos/platform processing_level=QA'ed and checked by Oceanographer program=Oil and Gas Loop Current VMADCP Program project=O&G LC VMADCP Program software=pycurrents sonar=wh300 source=Current profiler sourceUrl=(local files) Southernmost_Northing=26.964925 standard_name_vocabulary=CF Standard Name Table v67 subsetVariables=time, longitude, latitude, depth, u, v time_coverage_duration=P0Y0M0DT7H10M17S time_coverage_end=2006-07-15T19:33:39Z time_coverage_resolution=P0Y0M0DT0H4M59S time_coverage_start=2006-07-15T12:23:22Z Westernmost_Easting=-90.64675833333331 yearbase=2006
Facebook
TwitterAttribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
There is a lack of public available datasets on financial services and specially in the emerging mobile money transactions domain. Financial datasets are important to many researchers and in particular to us performing research in the domain of fraud detection. Part of the problem is the intrinsically private nature of financial transactions, that leads to no publicly available datasets.
We present a synthetic dataset generated using the simulator called PaySim as an approach to such a problem. PaySim uses aggregated data from the private dataset to generate a synthetic dataset that resembles the normal operation of transactions and injects malicious behaviour to later evaluate the performance of fraud detection methods.
PaySim simulates mobile money transactions based on a sample of real transactions extracted from one month of financial logs from a mobile money service implemented in an African country. The original logs were provided by a multinational company, who is the provider of the mobile financial service which is currently running in more than 14 countries all around the world.
This synthetic dataset is scaled down 1/4 of the original dataset and it is created just for Kaggle.
This is a sample of 1 row with headers explanation:
1,PAYMENT,1060.31,C429214117,1089.0,28.69,M1591654462,0.0,0.0,0,0
step - maps a unit of time in the real world. In this case 1 step is 1 hour of time. Total steps 744 (30 days simulation).
type - CASH-IN, CASH-OUT, DEBIT, PAYMENT and TRANSFER.
amount - amount of the transaction in local currency.
nameOrig - customer who started the transaction
oldbalanceOrg - initial balance before the transaction
newbalanceOrig - new balance after the transaction.
nameDest - customer who is the recipient of the transaction
oldbalanceDest - initial balance recipient before the transaction. Note that there is not information for customers that start with M (Merchants).
newbalanceDest - new balance recipient after the transaction. Note that there is not information for customers that start with M (Merchants).
isFraud - This is the transactions made by the fraudulent agents inside the simulation. In this specific dataset the fraudulent behavior of the agents aims to profit by taking control or customers accounts and try to empty the funds by transferring to another account and then cashing out of the system.
isFlaggedFraud - The business model aims to control massive transfers from one account to another and flags illegal attempts. An illegal attempt in this dataset is an attempt to transfer more than 200.000 in a single transaction.
There are 5 similar files that contain the run of 5 different scenarios. These files are better explained at my PhD thesis chapter 7 (PhD Thesis Available here http://urn.kb.se/resolve?urn=urn:nbn:se:bth-12932.
We ran PaySim several times using random seeds for 744 steps, representing each hour of one month of real time, which matches the original logs. Each run took around 45 minutes on an i7 intel processor with 16GB of RAM. The final result of a run contains approximately 24 million of financial records divided into the 5 types of categories: CASH-IN, CASH-OUT, DEBIT, PAYMENT and TRANSFER.
This work is part of the research project ”Scalable resource-efficient systems for big data analytics” funded by the Knowledge Foundation (grant: 20140032) in Sweden.
Please refer to this dataset using the following citations:
PaySim first paper of the simulator:
E. A. Lopez-Rojas , A. Elmir, and S. Axelsson. "PaySim: A financial mobile money simulator for fraud detection". In: The 28th European Modeling and Simulation Symposium-EMSS, Larnaca, Cyprus. 2016
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
By Reddit [source]
This dataset, labeled as Reddit Technology Data, provides thorough insights into the conversations and interactions around technology-related topics shared on Reddit – a well-known Internet discussion forum. This dataset contains titles of discussions, scores as contributed by users on Reddit, the unique IDs attributed to different discussions, URLs of those hidden discussions (if any), comment counts in each discussion thread and timestamps of when those conversations were initiated. As such, this data is supremely valuable for tech-savvy people wanting to stay up to date with the new developments in their field or professionals looking to keep abreast with industry trends. In short, it is a repository which helps people make sense and draw meaning out of what’s happening in the technology world at large - inspiring action on their part or simply educating them about forthcoming changes
For more datasets, click here.
- 🚨 Your notebook can be here! 🚨!
The dataset includes six columns containing title, score, url address link to the discussion page on Reddit itself ,comment count ,created time stamp meaning when was it posted/uploaded/communicated and body containing actual text written regarding that particular post/discussion. By separately analyzing each column it can be made out what type information user require in regard with various aspects related to technology based discussions. One can develop hypothesis about correlations between different factors associated with rating or comment count by separate analysis within those columns themselves like discuss what does people comment or react mostly upon viewing which type of post inside reddit ? Does high rating always come along with extremely long comments.? And many more .By researching this way one can discover real facts hidden behind social networking websites such as reddit which contains large amount of rich information regarding user’s interest in different topics related to tech gadgets or otherwise .We can analyze different trends using voice search technology etc in order visualize users overall reaction towards any kind of information shared through public forums like stack overflow sites ,facebook posts etc .These small instances will allow us gain heavy insights for research purpose thereby providing another layer for potential business opportunities one may benefit from over a given period if not periodcally monitored .
- Companies can use this dataset to create targeted online marketing campaigns directed towards Reddit users interested in specific areas of technology.
- Academic researchers can use the data to track and analyze trends in conversations related to technology on Reddit over time.
- Technology professionals can utilize the comments and discussions on this dataset as a way of gauging public opinion and consumer sentiment towards certain technological advancements or products
If you use this dataset in your research, please credit the original authors. Data Source
License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.
File: technology.csv | Column name | Description | |:--------------|:--------------------------------------------------------------------------| | title | The title of the discussion. (String) | | score | The score of the discussion as measured by Reddit contributors. (Integer) | | url | The website URL associated with the discussion. (String) | | comms_num | The number of comments associated with the discussion. (Integer) | | created | The date and time the discussion was created. (DateTime) | | body | The body content of the discussion. (String) | | timestamp | The timestamp of the discussion. (Integer) |
If you use this dataset in your research, please credit the original authors. If you use this dataset in your research, please credit Reddit.