In machine learning, reinforcement learning from human feedback (RLHF) is a technique to align an intelligent agent with human preferences. It involves training a reward model to represent preferences, which can then be used to train other models through reinforcement learning.
In classical reinforcement learning, an intelligent agent's goal is to learn a function that guides its behavior, called a policy. This function is iteratively updated to maximize rewards based on the agent's task performance.[1] However, explicitly defining a reward function that accurately approximates human preferences is challenging. Therefore, RLHF seeks to train a "reward model" directly from human feedback.[2] The reward model is first trained in a supervised manner to predict if a response to a given prompt is good (high reward) or bad (low reward) based on ranking data collected from human annotators. This model then serves as a reward function to improve an agent's policy through an optimization algorithm like proximal policy optimization.[3][4][5]
RLHF has applications in various domains in machine learning, including natural language processing tasks such as text summarization and conversational agents, computer vision tasks like text-to-image models, and the development of video game bots. While RLHF is an effective method of training models to act better in accordance with human preferences, it also faces challenges due to the way the human preference data is collected. Though RLHF does not require massive amounts of data to improve performance, sourcing high-quality preference data is still an expensive process. Furthermore, if the data is not carefully collected from a representative sample, the resulting model may exhibit unwanted biases.
Background and motivation
Optimizing a model based on human feedback is desirable when a task is difficult to specify yet easy to judge.[6] For example, one may want to train a model to generate safe text that is both helpful and harmless (such as lacking bias, toxicity, or otherwise harmful content). Asking humans to manually create examples of harmless and harmful text would be difficult and time-consuming. However, humans are adept at swiftly assessing and comparing the harmfulness of different AI-generated text. Therefore, a more practical objective would be to allow the model to use this type of human feedback to improve its text generation.[7]
Despite the clear benefits of incorporating human feedback in training models, prior efforts—including some that leverage reinforcement learning—have encountered significant challenges. Most attempts were either narrow and difficult to generalize, breaking down on more complex tasks,[8][9][10][11] or they faced difficulties learning from sparse (lacking specific information and relating to large amounts of text at a time) or noisy (inconsistently rewarding similar outputs) reward functions.[12][13]
RLHF was not the first successful method of using human feedback for reinforcement learning, but it is one of the most widely used. The foundation for RLHF was introduced as an attempt to create a general algorithm for learning from a practical amount of human feedback.[6][3] The algorithm as used today was introduced by OpenAI in a paper on enhancing text continuation or summarization based on human feedback, and it began to gain popularity when the same method was reused in their paper on InstructGPT.[2][14][15] RLHF has also been shown to improve the robustness of RL agents and their capacity for exploration, which results in an optimization process more adept at handling uncertainty and efficiently exploring its environment in search of the highest reward.[16]
Collecting human feedback
Human feedback is commonly collected by prompting humans to rank instances of the agent's behavior.[15][17][18] These rankings can then be used to score outputs, for example, using the Elo rating system, which is an algorithm for calculating the relative skill levels of players in a game based only on the outcome of each game.[3] While ranking outputs is the most widely adopted form of feedback, recent research has explored other forms, such as numerical feedback, natural language feedback, and prompting for direct edits to the model's output.[19]
One initial motivation of RLHF was that it requires relatively small amounts of comparison data to be effective.[6] It has been shown that a small amount of data can lead to comparable results to a larger amount. In addition, increasing the amount of data tends to be less effective than proportionally increasing the size of the reward model.[14] Nevertheless, a larger and more diverse amount of data can be crucial for tasks where it is important to avoid bias from a partially representative group of annotators.[15]
When learning from human feedback through pairwise comparison under the Bradley–Terry–Luce model (or the Plackett–Luce model for K-wise comparisons over more than two comparisons), the maximum likelihood estimator (MLE) for linear reward functions has been shown to converge if the comparison data is generated under a well-specified linear model. This implies that, under certain conditions, if a model is trained to decide which choices people would prefer between pairs (or groups) of choices, it will necessarily improve at predicting future preferences. This improvement is expected as long as the comparisons it learns from are based on a consistent and simple rule.[20][21]
Both offline data collection models, where the model is learning by interacting with a static dataset and updating its policy in batches, as well as online data collection models, where the model directly interacts with the dynamic environment and updates its policy immediately, have been mathematically studied proving sample complexity bounds for RLHF under different feedback models.[20][22]
In the offline data collection model, when the objective is policy training, a pessimistic MLE that incorporates a lower confidence bound as the reward estimate is most effective. Moreover, when applicable, it has been shown that considering K-wise comparisons directly is asymptotically more efficient than converting them into pairwise comparisons for prediction purposes.[22][23][15]
In the online scenario, when human feedback is collected through pairwise comparisons under the Bradley–Terry–Luce model and the objective is to minimize the algorithm's regret (the difference in performance compared to an optimal agent), it has been shown that an optimistic MLE that incorporates an upper confidence bound as the reward estimate can be used to design sample efficient algorithms (meaning that they require relatively little training data). A key challenge in RLHF when learning from pairwise (or dueling) comparisons is associated with the non-Markovian nature of its optimal policies. Unlike simpler scenarios where the optimal strategy does not require memory of past actions, in RLHF, the best course of action often depends on previous events and decisions, making the strategy inherently memory-dependent.[21]
Applications
RLHF has been applied to various domains of natural language processing (NLP), such as conversational agents, text summarization, and natural language understanding.[24][14] Ordinary reinforcement learning, in which agents learn from their actions based on a predefined "reward function", is difficult to apply to NLP tasks because the rewards tend to be difficult to define or measure, especially when dealing with complex tasks that involve human values or preferences.[6] RLHF can steer NLP models, in particular language models, to provide answers that align with human preferences with regard to such tasks by capturing their preferences beforehand in the reward model. This results in a model capable of generating more relevant responses and rejecting inappropriate or irrelevant queries.[15][25] Some notable examples of RLHF-trained language models are OpenAI's ChatGPT (and its predecessor InstructGPT),[17][26][27]DeepMind's Sparrow,[28][29][30]Google's Gemini,[31] and Anthropic's Claude.[32]
In computer vision, RLHF has also been used to align text-to-image models. Studies that successfully used RLHF for this goal have noted that the use of KL regularization in RLHF, which aims to prevent the learned policy from straying too far from the unaligned model, helped to stabilize the training process by reducing overfitting to the reward model. The final image outputs from models trained with KL regularization were noted to be of significantly higher quality than those trained without.[33][34] Other methods tried to incorporate the feedback through more direct training—based on maximizing the reward without the use of reinforcement learning—but conceded that an RLHF-based approach would likely perform better due to the online sample generation used in RLHF during updates as well as the aforementioned KL regularization over the prior model, which mitigates overfitting to the reward function.[35]
RLHF was initially applied to other areas, such as the development of video game bots and tasks in simulated robotics. For example, OpenAI and DeepMind trained agents to play Atari games based on human preferences. In classical RL-based training of such bots, the reward function is simply correlated to how well the agent is performing in the game, usually using metrics like the in-game score. In comparison, in RLHF, a human is periodically presented with two clips of the agent's behavior in the game and must decide which one looks better. This approach can teach agents to perform at a competitive level without ever having access to their score. In fact, it was shown that RLHF can sometimes lead to superior performance over RL with score metrics because the human's preferences can contain more useful information than performance-based metrics.[6][36] The agents achieved strong performance in many of the environments tested, often surpassing human performance.[37]
Training
In RLHF, two different models are trained: a reward model and a reinforcement learning (RL) policy. The reward model learns to determine what behavior is desirable based on human feedback, while the policy is guided by the reward model to determine the agent's actions. Both models are commonly initialized using a pre-trained autoregressivelanguage model. This model is then customarily trained in a supervised manner on a relatively small dataset of pairs of prompts to an assistant and their accompanying responses, written by human annotators. The reward model benefits from starting with a pre-trained model, as this initializes it with an understanding of language and focuses training explicitly on learning human preferences, speeding up the process. In addition to being used to initialize the reward model and the RL policy, the model is then also used to sample data to be compared by annotators.[15][14]
The reward model is then trained by replacing the final layer of the previous model with a randomly initialized regression head. This change shifts the model from its original classification task over its vocabulary to simply outputting a number corresponding to the score of any given prompt and response. This model is trained on the human preference comparison data collected earlier from the supervised model. In particular, it is trained to minimize the following cross-entropy loss function, which incentivizes it to make predictions that are closer to the actual human ratings:
where is the number of responses the labelers ranked, is the output of the reward model for prompt and completion , is the preferred completion over , denotes the sigmoid function, and denotes the expected value.[15] This loss function essentially measures the difference between the reward model's predictions and the decisions made by humans. The goal is to make the model's guesses as close as possible to the humans' preferences by minimizing the difference measured by this equation. In the case of only pairwise comparisons, the factor of is omitted.[14] Otherwise, all comparisons from each prompt are used for training as a single batch.[15] After training, the outputs of the model are normalized such that the reference completions have a mean score of 0.[14]
Similarly to the reward model, the human feedback policy is also fine-tuned over the pre-trained model. The objective of this fine-tuning step is to adapt the pre-existing, unaligned model (initially trained in a supervised manner) to better align with human preferences by adjusting its parameters based on the rewards derived from human feedback. The output of the reward model can be used as the reward to be maximized using RL for the prompt-response pairs.[14] The environment randomly presents the policy with prompts from the dataset and expects responses to them, simulating real-world scenarios where the agent must understand diverse prompts and generate appropriate responses. Denoting the learned RL policy with parameters as , we can define the following objective function:
where is the training distribution we are drawing from and is the previously trained, unaligned, model. The constant is used to adjust the intensity of the KL penalty term. This penalty is applied on a per-token basis between the policy and the unaligned models' outputs. Its purpose is to avoid excessively fine-tuning the policy, ensuring that the training process does not overly specialize the model on the new training data.[15][14] This KL term works by penalizing the KL divergence (a measure of statistical distance between distributions) between the model being fine-tuned and the initial supervised model. By choosing an appropriate , the training can balance learning from new data while retaining useful information from the initial model, increasing generalization by avoiding fitting too closely to the new data. Aside from preventing the new model from producing outputs too dissimilar those of the initial model, a second motivation of including the KL term is to allow the policy to further explore the environment by encouraging additional entropy, which can prevent the model from collapsing to a single mode.[14]
In simpler terms, the objective function calculates how well the policy's responses are expected to align with human feedback. The policy generates responses to prompts, and each response is evaluated both on how well it matches human preferences (as measured by the reward model) and how similar it is to responses the model would naturally generate. The goal is to balance improving alignment with human preferences while ensuring the model's responses remain diverse and not too far removed from what it has learned during its initial training. This helps the model not only to provide answers that people find useful or agreeable but also to maintain a broad understanding and avoid overly narrow or repetitive responses.
A second term is commonly added to the objective function that allows the policy to incorporate the pre-training gradients. This term keeps the model from losing its initial language understanding ability while it learns new tasks based on human feedback by incorporating its original pre-training task of text completion. The final objective function is written as:
where controls the strength of this additional term and is the original pre-training text distribution.[15] This objective function can then be directly used to train the policy using the proximal policy optimization algorithm.[15][14]
In total, this objective function defines the method for adjusting the RL policy, blending the aim of aligning with human feedback and maintaining the model's original language understanding.
Limitations
RLHF suffers from challenges with collecting human feedback, learning a reward model, and optimizing the policy.[38] Compared to data collection for techniques like unsupervised or self-supervised learning, collecting data for RLHF is less scalable and more expensive. Its quality and consistency may vary depending on the task, interface, and the preferences and biases of individual humans.[15][39]
The effectiveness of RLHF depends on the quality of human feedback. For instance, the model may become biased, favoring certain groups over others, if the feedback lacks impartiality, is inconsistent, or is incorrect.[3][40] There is a risk of overfitting, where the model memorizes specific feedback examples instead of learning to generalize. For instance, feedback predominantly from a specific demographic might lead the model to learn peculiarities or noise, along with the intended alignment. Excessive alignment to the specific feedback it received (that is, to the bias therein) can lead to the model performing sub-optimally in new contexts or when used by different groups.[41] A single reward function cannot always represent the opinions of diverse groups of people. Even with a representative sample, conflicting views and preferences may result in the reward model favoring the majority's opinion, potentially disadvantaging underrepresented groups.[38]
In some cases, as is possible in regular reinforcement learning, there may be a risk of the model learning to manipulate the feedback process or game the system to achieve higher rewards rather than genuinely improving its performance.[42] In the case of RLHF, a model may learn to exploit the fact that it is rewarded for what is evaluated positively and not necessarily for what is actually good, which can lead to it learning to persuade and manipulate. For example, models might learn that apparent confidence, even if inaccurate, garners higher rewards. Such behavior, if unchecked, is not just incentivized but can cause significant deployment issues due to the model's potential to mislead. Studies have found that humans are not skilled at identifying mistakes in LLM outputs in complex tasks; therefore, models learning to generate confident-sounding yet incorrect text can lead to significant issues when deployed.[38]
Alternatives
Reinforcement learning from AI feedback
Similarly to RLHF, reinforcement learning from AI feedback (RLAIF) relies on training a preference model, except that the feedback is automatically generated.[43] This is notably used in Anthropic's constitutional AI, where the AI feedback is based on the conformance to the principles of a constitution.[44]
Direct preference optimization
Another alternative to RLHF called Direct Preference Optimization (DPO) has been proposed to learn human preferences. Like RLHF, it has been applied to align pre-trained large language models using human-generated preference data. Unlike RLHF, however, which first trains a separate intermediate model to understand what good outcomes look like and then teaches the main model how to achieve those outcomes, DPO simplifies the process by directly adjusting the main model according to people's preferences. It uses a change of variables to define the "preference loss" directly as a function of the policy and uses this loss to fine-tune the model, helping it understand and prioritize human preferences without needing a separate step. Essentially, this approach directly shapes the model's decisions based on positive or negative human feedback.
DPO is simpler to implement and train than RLHF and has been shown to produce comparable and sometimes superior results.[45] Nevertheless, RLHF has also been shown to beat DPO on some datasets, for example, on benchmarks that attempt to measure truthfulness. Therefore, the choice of method may vary depending on the features of the human preference data and the nature of the task.[46]
^Russell, Stuart J.; Norvig, Peter (2016). Artificial intelligence: a modern approach (Third, Global ed.). Boston Columbus Indianapolis New York San Francisco Upper Saddle River Amsterdam Cape Town Dubai London Madrid Milan Munich Paris Montreal Toronto Delhi Mexico City Sao Paulo Sydney Hong Kong Seoul Singapore Taipei Tokyo: Pearson. pp. 830–831. ISBN978-0-13-604259-4.
^ abZiegler, Daniel M.; Stiennon, Nisan; Wu, Jeffrey; Brown, Tom B.; Radford, Alec; Amodei, Dario; Christiano, Paul; Irving, Geoffrey (2019). "Fine-Tuning Language Models from Human Preferences". arXiv:1909.08593 [cs.CL].
^Schoenauer, Marc; Akrour, Riad; Sebag, Michele; Souplet, Jean-Christophe (18 June 2014). "Programming by Feedback". Proceedings of the 31st International Conference on Machine Learning. PMLR: 1503–1511. Retrieved 26 February 2024.
^Warnell, Garrett; Waytowich, Nicholas; Lawhern, Vernon; Stone, Peter (25 April 2018). "Deep TAMER: Interactive Agent Shaping in High-Dimensional State Spaces". Proceedings of the AAAI Conference on Artificial Intelligence. 32 (1). arXiv:1709.10163. doi:10.1609/aaai.v32i1.11485. S2CID4130751.
^MacGlashan, James; Ho, Mark K.; Loftin, Robert; Peng, Bei; Wang, Guan; Roberts, David L.; Taylor, Matthew E.; Littman, Michael L. (6 August 2017). "Interactive learning from policy-dependent human feedback". Proceedings of the 34th International Conference on Machine Learning - Volume 70. JMLR.org: 2285–2294. arXiv:1701.06049.
^ abcdefghijNisan Stiennon; Long Ouyang; Jeffrey Wu; Daniel Ziegler; Ryan Lowe; Chelsea Voss; Alec Radford; Dario Amodei; Paul F. Christiano (2020). "Learning to summarize with human feedback". Advances in Neural Information Processing Systems. 33.
^ abcdefghijklOuyang, Long; Wu, Jeffrey; Jiang, Xu; Almeida, Diogo; Wainwright, Carroll; Mishkin, Pamela; Zhang, Chong; Agarwal, Sandhini; Slama, Katarina; Gray, Alex; Schulman, John; Hilton, Jacob; Kelton, Fraser; Miller, Luke; Simens, Maddie; Askell, Amanda; Welinder, Peter; Christiano, Paul; Leike, Jan; Lowe, Ryan (31 October 2022). Training language models to follow instructions with human feedback. Thirty-Sixth Conference on Neural Information Processing Systems: NeurIPS 2022. arXiv:2203.02155.
^Fernandes, Patrick; Madaan, Aman; Liu, Emmy; Farinhas, António; Pedro Henrique Martins; Bertsch, Amanda; de Souza, José G. C.; Zhou, Shuyan; Wu, Tongshuang; Neubig, Graham; Martins, André F. T. (2023). "Bridging the Gap: A Survey on Integrating (Human) Feedback for Natural Language Generation". arXiv:2305.00955 [cs.CL].
^Rafailov, Rafael; Sharma, Archit; Mitchell, Eric; Ermon, Stefano; Manning, Christopher D.; Finn, Chelsea (2023). "Direct Preference Optimization: Your Language Model is Secretly a Reward Model". arXiv:2305.18290 [cs.LG].
Lambang Las Vegas Metropolitan Police Department Las Vegas Metropolitan Police Department (juga dikenal sebagai LVMPD atau Metro) adalah pasukan polisi gabungan kota-county untuk Clark County, Nevada. Dijalankan oleh Sheriff Clark County, dipilih setiap empat tahun sekali. Sheriff adalah satu-satunya kepala penegak hukum terpilih di county ini. Badan besar lainnya: Nevada Highway Patrol (NHP), Henderson Police Department, North Las Vegas Police Department, Boulder City Police Department, Mesq...
Aikido合気道salah satu teknik Aikido yaitu shihōnageFokusBerjarakNegara asalJepangPenciptaMorihei UeshibaPraktisi terkenalShingen TakedaOrang tuaaiki-jutsu; judo; jujutsu; kenjutsu; sōjutsuOlahraga olimpikTidak Aikido (bahasa Jepang: 合気道, aikidō) adalah seni bela diri yang mempunyai akar pertumbuhan dan budaya dari Jepang. Aikido merupakan manifestasi dari modernisasi pemikiran Jepang dengan selimut budaya Jepang tradisional. Hal ini membuat seni beladiri yang dikembangkan oleh Mo...
Untuk nama provinsi di Indonesia, lihat Papua Barat. Republik Papua Barat Bendera Papua Barat Lambang Lagu kebangsaan: Hai Tanahku PapuaStatusNegara yang belum sepenuhnya terbentukIbu kotaKota SorongBahasa yang umum digunakanIndonesia, Bahasa Melayu (Klaim Pemerintah Negara Federal Republik Papua Barat)[1], Belanda, Tok Pisin, dan Rumpun bahasa Papua lainnyaDemonimOrang PapuaPemerintahanPemerintahan interim• Presiden Forkorus Yaboisembut (versi NFRPB)[2]Menase...
بلدة روتلاند تشارتر الإحداثيات 42°38′58″N 85°21′05″W / 42.649444444444°N 85.351388888889°W / 42.649444444444; -85.351388888889 [1] تقسيم إداري البلد الولايات المتحدة[2] التقسيم الأعلى مقاطعة باري خصائص جغرافية المساحة 93.6 كيلومتر مربع ارتفاع 249 متر عدد السكان عد�...
Ligue des champions de l'UEFA 2019-2020 Généralités Sport Football Organisateur(s) UEFA Édition 65e Lieu(x) 1/4 de finale, 1/2 finale et Finale : Estádio José AlvaladeEstádio da Luz (finale)Lisbonne, Portugal Date Phase qualificative (2019) :25 juin - 28 aoûtPhase de groupes (2019) :17 septembre - 11 décembrePhase à élimination directe (2020) : 18 février - 23 août Participants 79 clubs européens en qualification, 32 à partir de la phase de groupes Matchs j...
Guglielmo Sansoni, detto Tato (Bologna, 1896 – Roma, 1974), è stato un artista futurista italiano, uno dei protagonisti dell'aeropittura. Indice 1 Biografia 1.1 Formazione futurista 1.2 Manifesto dell'Aeropittura 1.3 Ceramiche 1.4 Fotografia 2 Esposizioni 2.1 Retrospettive 3 Opere 3.1 Decorazioni 3.2 Opere singole 3.2.1 Dipinti 3.2.2 Opere su carta[27] 3.2.3 Fotografie[29] 3.2.4 Sculture e ceramiche 3.3 Altre produzioni 4 Note 5 Altri progetti 6 Collegamenti esterni Biografia Formatosi com...
Quadrennial classical music competition in Fort Worth, Texas The CliburnFounded1962; 62 years ago (1962)TypeNon-governmental organizationFocusPiano competitionLocationFort Worth, TexasWebsitecliburn.org The Van Cliburn International Piano Competition (The Cliburn) is an American piano competition by The Cliburn, first held in 1962 in Fort Worth, Texas and hosted by the Van Cliburn Foundation. Initially held at Texas Christian University, the competition has been held at the ...
This article is about Mersad Air Defense System. For Iran-Iraq War Opereation, see Operation Mersad. This article uses bare URLs, which are uninformative and vulnerable to link rot. Please consider converting them to full citations to ensure the article remains verifiable and maintains a consistent citation style. Several templates and tools are available to assist in formatting, such as reFill (documentation) and Citation bot (documentation). (August 2022) (Learn how and when to remove this ...
Der Titel dieses Artikels ist mehrdeutig. Zum gleichnamigen Ortsteil von Kürten siehe Sürth (Kürten) Wappen von Köln Sürth Stadtteil 210 von Köln Koordinaten 50° 51′ 40″ N, 7° 0′ 22″ O50.8611111111117.0061111111111Koordinaten: 50° 51′ 40″ N, 7° 0′ 22″ O Fläche 3,42 km² Einwohner 10.945 (31. Dez. 2021) Bevölkerungsdichte 3200 Einwohner/km² Eingemeindung 1. Jan. 1975 Postleitzahl ...
English cleric and theologian (1558–1602) The ReverendWilliam PerkinsPainting by an unknown artist inscribed 1602Born1558Marston Jabbet, WarwickshireDied1602 (aged 43–44)Alma materChrist's College, CambridgeOccupation(s)Clergyman, TheologianNotable workThe Arte of ProphesyingSpouseTimothye CradockeTheological workEraElizabethan eraTradition or movementPuritanism, CalvinismNotable ideasLaw and Gospel William Perkins (1558–1602) was an influential English cleric and Cambridg...
Para el club de fútbol uruguayo, véase Liverpool Fútbol Club. Liverpool FC Datos generalesNombre Liverpool Football ClubApodo(s) The Reds (Los Rojos)[1]Fundación 3 de junio de 1892 (132 años)[2]Propietario(s) Fenway Sport GroupPresidente Tom WernerEntrenador Arne SlotInstalacionesEstadio AnfieldUbicación Anfield Road, L2-0THLiverpool, MerseysideCapacidad 53 394 espectadores[3]Inauguración 28 de septiembre de 1884 (139 años)Otro complejo Melwood, We...
أبو إسماعيل الهروي تخطيط إسم أبو إسماعيل الهروي بخط الثلث معلومات شخصية الميلاد 4 مايو 1006 [1] هراة الوفاة 8 مارس 1089 (82 سنة) هراة مواطنة الدولة العباسية الدولة السلجوقية الديانة الإسلام السني المذهب الفقهي حنبلي[2] العقيدة أثري[3] الحياة العملية تعلم لدى ...
Pour les articles homonymes, voir Caylina. Paolo da Caylina l'AncienMadonna col Bambino tra i santi Lorenzo e Ambrogio(Église Saint-Nazaire-et-Saint-Celse de Brescia)Naissance Entre 1420 et 1430BresciaDécès Après 1486BresciaAutres noms Paolo da Caylina il VecchioPaolo da BresciaActivité PeintreMouvement RenaissanceEnfant Paolo da Caylina le Jeunemodifier - modifier le code - modifier Wikidata Paolo da Caylina l'Ancien (Paolo da Caylina il Vecchio en italien), également appelé Paolo da...
Head of the Catholic Church from 1903 to 1914 This article is about the pope who was canonized. For the namesake fraternity, see Society of Saint Pius X. For other uses, see St. Pius X (disambiguation). Pope SaintPius XBishop of RomePius X c. 1914ChurchCatholic ChurchPapacy began4 August 1903Papacy ended20 August 1914PredecessorLeo XIIISuccessorBenedict XVOrdersOrdination18 September 1858by Giovanni Antonio FarinaConsecration16 November 1884by Lucido Maria ParocchiCreated card...
Disambiguazione – Se stai cercando altri significati, vedi Oceano (disambigua). Gli oceani del mondo, secondo il criterio utilizzato in Italia, che ne identifica tre Gli oceani del mondo, secondo il criterio anglosassone, che ne identifica cinque L'oceano è l'insieme delle vaste distese d'acqua salata presenti sulla superficie terrestre; si tratta di un complesso unico e continuo, che circonda i continenti e le isole e che comprende la maggior parte della superficie terrestre (circa i...
Нетти Витцирс-Тиммер Общая информация Полное имя Жаннетт Жозефина Мария Витцирс-Тиммер Дата и место рождения 22 июля 1923(1923-07-22)Амстердам, Нидерланды Дата и место смерти 25 января 2005(2005-01-25) (81 год)Амстердам, Нидерланды Гражданство Нидерланды Тренер Ян Бланкерс[англ.] Личн�...