Generating music from foundationsAuthor: Kevin Yin

The above button generates music by modeling how humans hear and process notes. Musical behaviors like tonality, melody, and structure arise emergently. No AI is involved. Compare the generator's quality to the state of the art in AI music.

The idea is that musical behaviors are logical consequences of how humans perceive music. For example, roughness and consonance create tonality, expectations create scales, and novelty and order create motifs. Since perception and processing are the source of musical behaviors, appropriate parts of music theory should arise when their foundations are created. This article describes novel properties of perception and shows how they create familiar properties of music composition.

The goal is to implement everything in code to remove any fuzzy or subjective parts. This is worthwhile because the theory specifies how humans decide whether music is good. With this evaluation function, running an optimization algorithm is equivalent to how humans compose music: trying revisions and checking the quality of the result.

The competing approach is to model musical behaviors directly, rather than trying to derive them from perception. In practice, modeling perception uses fewer rules to express the same behaviors, and its mechanistic explanations help prevent corner cases. For example, a large and incomplete list of chord progressions is replaced by a calculation of harmonic consonance. The major downside is that modeling perception is difficult: it's not enough to find behaviors in music; one must also find their causes.

An advantage of working with perception is that its theories are scientific rather than artistic. It makes falsifiable predictions, which enable simple and replicable experiments. The learnings from these experiments are helpful in general, not tailored to a single generator. Fixing one issue won't knock over other hidden causal variables, since the foundations are modeled, not surface correlations. So there are no worries about treading water.

The generator is missing some theories, so it doesn't exhibit their corresponding behaviors. Pitch and harmony are 80% implemented. Memory is 1/2 implemented. The evaluation function has 3/4 theory done, but is 20% implemented. Connections have complete theory and are 70% implemented, but are not turned on. Rhythm, tension, and timbre have minimal progress. Missing areas are either generated randomly (like rhythm), or stubbed with poor approximations.

Here are the core theories:

  1. Roughness and consonance are the source of harmony. Using these calculations frees us from the twelve-tone scale.
  2. A note "makes sense" when the listener hears a set of logical rules that he values. For example, "sounds consonant" is a rule. When phrases are retrieved from memory, past rules change which rules are valued in the present, which creates variation.
  3. Auditory memory specifies which phrases can be retrieved, hence controls short-term and long-term structure. Connections and expectations describe whether phrases are perceived as similar or unrelated. Similar phrases are retrieved from memory.
  4. Humans judge music as good if it simultaneously "makes sense" and is new to them. It's like a demonstration of unexpected structure. This forms the evaluation function, and all factors are calculable.

Other important theories are tension, meter, timbre, pitch memory, and auditory scene analysis.

I strove for conciseness. Sections:

Humans and computers

Human composers grow their abilities by two methods.

The first method is to form intuitive estimates of the hidden rules of music, such as in artificial grammar learning. Listening and experimentation create improvement through experience, but this improvement is not transferable to others. Often, this inexpressibility creates a sense of mystery.For example, experts on swing could perform it but gave nonsense answers when asked to define it. Their assertions that swing is magical and ineffable contradict our understanding of swing today.

The second method is to guess the hidden rules explicitly, then write these guesses down concretely and clearly. This is much harder, but has an important advantage over intuition: these rules can be conveyed, so they can be improved and dissected by many people, as part of a scientific process.

Musicians use both methods, in theory and practice: learning music theory, and practicing to build intuition.

Computer music generation nowadays focuses on machine learning. Riffusion was a quantum leap in quality and method. So music generation has shifted toward the first method, intuition.

Meanwhile, auditory scientists follow the second method. They experiment on human audio perception and write down their theories and results. They are generally uninvolved with generating music. There's not much mixing between the auditory scientists, who analyze humans, and the computer music researchers, who analyze compositions.

Our generator follows the auditory scientists. Its rules for composition arise indirectly from human audio perception. This is in contrast to other computer music researchers, who derive such rules by hand or by AI, by looking at prior compositions.

Memory and structure

Auditory memory is not random-access; it can only retrieve specific notes. You can try these experiments now to see quirks of your long-term auditory memory:

  1. Memory moves forward, not backward. Consider the lyrics of a song you are familiar with. You are able to recite them. But it's hard to recite the lines in reverse order: last line, then second-to-last line, etc. Similarly, it's hard to recite the words in reverse order. If you try, you will find that you are picking random verses, reciting forward until you reach the previous stopping point, then reversing. This transfers those lines from long-term memory into short-term memory.
  2. You can skip to the next line, and you can recall the beginning of the current line. You can recite the first line of each word. At any word, you can recall the first word of the next line and the first word of the current line.
  3. You can't skip to other positions. To recite every other word, you must retrieve every word and drop the words you don't need; you can't skip to the next next word without retrieving the word in the middle. The same applies if you try to retrieve the first word of every other line: you must retrieve the first word of every line and drop half the words. You also can't retrieve the second word from the next line, unless you retrieve the first word first.
  4. Random positions are hard to retrieve. If you try to list random lyrics, you can only recall a few. In addition, each lyric you find will start at the beginning of a line.
  5. Retrieval can be prompted. If you are given a line from a song, you can name the song and retrieve the next line. This works even if the given words start in the middle of a line, but it's slower. So even though you can't recall arbitrary memory locations on your own, they are still readily accessible when given externally.

Memory for pitch and speech are separate, but they share these properties. Metric stresses and phrase beginnings correspond to line beginnings, and notes correspond to words. (You can try the same tests on an instrumental song you're familiar with.)

All these rules are only for long-term auditory memory. To analyze short-term auditory memory, you can listen to an instrumental-only song (such as this one). Listening will reveal that:

  1. You can recall all the most recent notes. With difficulty, you can replay those notes backwards, which wasn't possible with long-term memory.
  2. You can retrieve phrases from a short while ago, but only starting from the beginning of the phrase or a stress, not from the middle.

Ok, you can turn the song off now.

Music structure is caused by relationships between phrases. Spotting a relationship means finding a phrase in the past that is similar in some way to a phrase now, then determining the difference. To find this relationship:

  1. The old phrase must be in memory. This means the list of memory limitations also controls which phrases can be retrieved. Short term and long term memory have different limits, which influence short term and long term structure.
  2. Out of a pool of many similar and dissimilar phrases, you must be able to find which phrase is most similar, without using attention! Your attention is only able to focus on one phrase at a time, so it can't check all the phrases one-by-one for similarity. And we already know that long-term memory is unable to retrieve arbitrary positions.

Your mind has a special mechanism to retrieve a fuzzy match from memory, without using attention. It tests phrases in parallel, so it can search through a large database.more precisely, modern Feature Integration Theory says the speed is logarithmic You may not even detect the difference between the found phrase and the template phrase. Since this process uses no conscious thought, it must be simple. We'll discuss the distance function of this fuzzy match in the next section.

Memory decays at different rates for different features. For example, timbre and loudness are recorded in short term memory, but are poorly copied into long term memory. Pitch and rhythm are better remembered.

At the bottom of each section is a button listing the relevant prior research and inspirations, like this one:

There's some research onZimmermann et al. (2016) auditory memorySchulze et al. (2018), but it doesn't align with topics useful for computer music generation. Researchers found the separation of auditory memory into sensory memory (notes heard now), short-term memory, and long-term memory. Baddeley characterized the phonological loop. Deutsch and others separated memories for pitch and non-tonal speech, and performed experiments on pitch memory, although these experiments are not used here.

Language of music

Given two phrases, you can compare them and determine if they are related or not. This means there is a distance function between phrases.

A note is formed of features such as pitch, brightness, and loudness. Each feature is conceptually "one property only". Behavior is similar to Feature Integration TheoryWolfe (2020). Features are perceived independently, then combined; illusory conjunctions support this explanation.

A connection is a relationship between notes held simultaneously in short term memory.Notes from long term memory can be retrieved into short term memory, and hence become available for connections. Connections only matter if your mind recognizes them. Here are some example connections:

A connection is decorated with two properties:

Memory recall pairs up notes in the past to notes in the present. This recall is represented by a ladder of association, where each rung pairs two notes. The left of the ladder represents notes in the past, the right of the ladder represents notes in the present, and the ladder says "these notes are associated to each other". By the previous section, memory is recalled sequentially. It is the same for ladders, which grow one rung at a time as notes are sequentially recalled and associated to the present. Each ladder corresponds to one timepoint being recalled, travelling forward in time, and multiple ladders can be active simultaneously. Notes can be omitted from a ladder, but each side can only have one copy of each note.

The effect of a ladder is that connections on the left side create expectations for similar connections to appear on the right side. The strength of the expectation is increased by successful expectations of the same type. For example, expectations in pitch do not affect expectations in timbre, whether they are successful or not.

The strength of the ladder is how willing the mind is to continue retrieving notes from memory. It is equal to the sum of the strengths of its expectations. When this strength falls to 0, the retrieval stops and the ladder terminates. Ladders have an initial setup cost equal to the difficulty of retrieval from memory. For example, if two phrases are 50 notes apart, it is much harder to retrieve the first phrase than if they are 4 notes apart. The ladder will not be recognized until enough expectations have been fulfilled.

A connection's order is determined by whether expectations support or contradict it. For example, in this phrase of 8 notes, the default expectation is that the first 4 notes are repeated by the second 4 notes. The last note is not an exact repeat, but its perceived order is still higher than if the first 4 notes did not exist at all. Hence, the expectation created by the first 4 notes boosts the order of the last note's pitch.

Some notes are easier to recall from memory: phrase beginnings, stresses, and beginnings of previous ladders. Observationally, when composers copy prior snippets from inside a work, the copies usually start at these notes. This accords with theory, which predicts that retrieval from these notes is more efficient, hence ladders are easier to build. Likewise, composers frequently terminate the copy at a phrase ending, which is where connections either cease or become ambiguous.

Connections and ladders explain many behaviors of music structure in a parsimonious way:

While connections and ladders have structure that cause the above behaviors, they do not predict why these behaviors should be preferred. All they predict is what is perceived as "ordered" and which connections are considered important. Just because a note is ordered does not mean it is good. The final piece of this puzzle is in the next section.

Note that connections are evaluated within one side of a ladder, not across the ladder, which may transform their perception in nonobvious ways. For example, if we merge phrases 10 3 9 4 8 5 7 6 and 6 7 5 8 4 9 3 10 into 10-6 3-7 9-5 4-8 8-4 5-9 7-3 6-10, the pitch contours of the components are not recognizable in the merged phrase. This is because stream segregation forces all the top pitches to connect to each other, so you cannot perceive the connections between alternating high-low notes.

The initial cost to set up a ladder is roughly the feature distance between the first two notes on each side, plus a memory cost. It's possible that features without labels, like timbre, are poorly encoded in long term memory. Rhythm is a special feature; it forms an index by which the note is retrieved, so it is bound to all other properties of the note, and is not an independent feature like pitch or loudness. In a later section, rhythm is described; a note is indexed by its position within nested sets of metric units, and time offset from that metrical position. The note's "handle", in computing terms, is not its time, but its rhythmic position, as an iterator into a skip-list-like structure. The strongest connections for phrase retrieval are rhythm, timbre, pitch, and harmony.

Applying feature theoriesShinn-Cunningham (2008) to audioSpence and Frings (2020) is relatively unexplored. There are actually two separate distances. The distance used for retrieval from memory is different from the distance after attention. These are the "search template" and "target template". I have not explored the differences between these two in audio. It's also unclear how features cleave, if they are not built-in. For example, proximity and envelopment were clarified by research on reverb by Tapio Lokki's labKaplanis et al. (2019) and David Griesinger. But most people do not read these papers, so they do not separate proximity from the "timbre" catch-all. Does proximity really behave as a feature? Does isolating it change its distance function, and what is its distance function pre-isolation?

What makes music good

The goal of music is to learn, forming a mental model of sound. The language of connections is a set of axioms, and music is a demonstration of the emergent structure. This is a specific type of learning, caused only by improvement of a mental model for music, like the training of a neural network. Other types of learning, such as memorizing facts, do not evoke the same emotion. Music is good when it maximizes this metric: training = [distance from what you know] * [order]."training" is a poor choice of terminology, but I didn't find a better word

  1. Distance from what you know = how poorly it is encoded by your existing mental model. This represents what you don't already understand. For example, you have (likely) already memorized Twinkle Twinkle Little Star, so it's fully understood and encoded, and its distance is 0. Simple transforms won't affect it: if this song is heard one octave higher, you still understand it in relation to the original song, so distance is still 0. But if the song is reharmonized, there is no obvious transform to construct it, so the distance is positive. A remix may not be very surprising, so it has low distance, whereas a counterexample to a foundational belief will have high distance.
  2. Order = if it "makes sense". For connections, order = high quantity and order of unique connections. It matters that the listener personally perceives these connections and considers them highly ordered.

Training is high when distance and order are simultaneously high. This is for an idea that makes sense, but that you did not believe could exist, or that didn't fit what you already knew. Another way to see it is as an interesting surprise.This is also how people judge math theorems.

Since learning is the goal, it is natural that the metric should reward updating the mental model. However, difficulty in encoding the surprise in the model does not reduce perception of training.in fact, it may actually increase perception of training, as it makes the idea seem mysterious instead of simple An example is a stage magician who shows off a cool trick that the audience cannot figure out. The audience's model update is only that it is possible, not how it happens, but the trick is still cool. The distance to the model is sufficient; actually learning the mechanics is optional. This means hearing good music does not guarantee the mental model is making progress.

Music preferences changing over time is a sign of personal growth. As your model improves, you comprehend old songs (thus making old songs boring) and perceive more connections (thus opening up new songs). One example is harmony; people initially recognize no intervals, even octavesJacoby et al. (2019). As they listen to music, they start to hear common intervals as harmonic, so music relying on these intervals becomes accessible and interesting. Jazz listeners assign order to even more intervals, which lets them enjoy music with those intervals.theory does not say whether assigning order to those intervals is good or bad

Many works have structure that listeners cannot hear. For example, Bach's Goldberg Canons follow simple rules of retrograde, inversion, and time shifting. But these rules are not discovered during listening unless the listener is told to look for them. Thus, these rules do not contribute to order. A common sign of incomplete comprehension is repetition, either within a piece, or by re-listening to the same piece. Fuzzy matching and memory are limited, which forms obstacles to spotting connections.

For instrumental music, there are four major sources of training.

The first source is timbre. Perception of quality in the first 10 seconds of a song is driven mostly by timbre. Listeners learn both individual timbres and the paths between timbres. For example, a lowpass filter applied to a violin may not be interesting. But if the filter's cutoff frequency changes over time, it shows a smooth path between different timbres that the listener may not have known. Examples:

The second source is limits and ambiguities of mental processing. This is when the language of music fails to characterize music properly, but expectations are still met. These deviations from structure create moments of discovery. Examples:

The third source is ties to external traits. This category is out of scope for this article, since it involves complicated problems like language modeling. Examples:

The last source is connections. Examples of training from connections:

The naive way to generate music is to create connections of varying complexity. High complexity connections are likely to be new, low complexity connections are ordered, and sometimes these connections link together to be interesting. For now, this is what the generator does, since implementation of training is unfinished.

However, connections are not independent. Pitch affects stream segregation, and pitch contour and analog pitch interact. Timing and rhythm interact with each other, and also affect information delivery and lyrics. Timbre affects chordalness, pitch, and brightness. Most features affect tension. Harmony is indirectly created, as it is derived from frequency. As a consequence of these interactions, connections must be measured after creation, rather than predicted in advance.

The better way to generate music is as an optimization problem with training as the metric. Some simplifying assumptions are still necessary. The listener's model only contains material he has previously heard from the same piece (or past pieces the algorithm knows about); these share ladders with the expectations. The distance function specifies the obvious transformations as modifications of independent features. Comprehension is influenced by memory limits, complexity of the connections, attention, and time (slower notes are easier to process). Order is calculated as in the previous section. Order and distance must happen together in a single feature. For example, if the pitches are random and hence novel, and the rhythm is highly ordered, there is still no training, because the novelty and order are in different features. Tying training to individual features is valid to the extent that features are independently modifiable, and that the listener understands this independence. This handles training from connections.

The other two sources of training, timbre and ambiguities, appear to be a collection of special cases. They obey the same principle and calculations, but the individual instances are too heterogeneous to have an elegant theory.

Learning in music generalizes to other fields in the concept of "fun", consisting of self-improvement of (mental) capabilities.This definition requires much more specification, with 400 pages of unpublished research laying out the technicalities and boundaries. But these details have little relevance to music. Fun is the core of fiction novels, games, art, movies, humor, and music. For example, in game design, players have fun when they perceive:

  1. their effort leads to improvement in ability
  2. that improved ability matters, leading to an improved outcome either inside or outside the game

Music is fun, so the mind believes that listening to music will increase some ability. Musicians consistently do better in someAgres and Krumhansl (2008) domains, mainly tonality and tonal memory, but real-world benefits are inconclusiveNeves et al. (2022).

In modern classical music, there is a tendency for composers to write connections that listeners cannot hear, but that people reading the score can see. As usual, connections only matter if you can perceive them - so the listeners won't enjoy the work, but the readers might. For example, Bach's Goldberg Canons are like a clever math puzzle. The rigidity in their construction can make them interesting. Maybe you care about such things, or maybe not, but math puzzles are also out of scope for this article. For our definition of training to be valid here, we would need to replace order with value, and then you would specify whether you value the order in these puzzles.
The Rite of Spring is another example of people perceiving order differently. Its rhythms and harmony are very high distance, since they're in extreme contrast to prior works. If you consider freeform rhythm and high dissonance to "make sense", then it's a great work, and otherwise it's a meaningless work. This is why reactions are polarized into acclaim and disapproval. As with jazz harmonies, both perspectives are justified.

From Boden (1998), "A creative idea is one which is novel, surprising, and valuable (interesting, useful, beautiful...)." That is a good definition. In music, distance measures what is "surprising", and order measures what is "valuable". I did not find any divergence between value and order in music. Since order is easier to understand, I used order. But "value" is the better concept when generalizing to other fields.
Boden's core idea is also valid for fun, but requires filtering through psychology. For example, leveling up and finding equipment in an RPG game are "fun", although they are neither "surprising" nor "valuable". And listening to a lecture about paradoxes of infinity may not be fun, even if it is on your next exam and hence both surprising and valuable. But these violations are only definitional, and the core idea still works.
Benign Violation Theory (2010) is a similar theory in humor, whose word "violation" is insightful. BVT would be better if it considered value. The result of a 20-sided dice roll is both benign and surprising (5% chance to predict correctly), but it has zero value and is hence not humorous. Some things also cause awe rather than humor.

Tension

Tension is the expectation of imminent future training, where training is as defined in the previous section. The outcome of tension is a focusing of attention and processing.

If there is no expectation of training, there is no tension. Random notes are unordered and cause dissonance, but no tension. The hum of a power line is ordered and causes boredom, but no tension.

Tension is only an expectation of training, not a guarantee. If the training appears as expected, it causes a resolution. If the training fails to appear, the tension will continue or disappear unsatisfactorily.

Observationally, music sections often end in a resolution, rather than tension. Theory predicts that this is efficient: if ideas were appended to the end without the corresponding connections, those ideas could be removed without decreasing training.this ignores mood-setting music, which uses tension for a different purpose Also, since the resolution has strong connections to prior notes, a separation between the resolution and its connected notes would reduce its power. This feature of tension and resolution is a result of process, not a fundamental property.

Here are some sources of tension:

This link has audio examples of tension.

There are interesting ways to violate tension, using the third source of training, limits and ambiguities of processing. For example, a constantly increasing speed appears unsustainable, but a Risset rhythm can continue forever.

Tension theory is still imprecise. Separating cause and effect is necessary for these co-occurring behaviors: expectation of imminent training, focusing of processing/training, and expectation of change.

Composers have been successful at listing surface features of tension, such as dissonance and repetition, and at constructing and recognizing examples of tension. But providing an accurate definition has been elusive. Some researchers, such as Huron, have realized that tension is "expectation" or "anticipation" of some kind, but their further specifications are not illuminating.

Rhythm

Rhythm is decided by the beginnings of notes, and note endings are unimportant. These note beginnings are placed into structures we will call "metric units". There are three main operations listeners can hear:

Division automatically causes merging. If the division is not a mergeable number, then no metric unit is created.

For example, Take 5, with a 5/5 meter, merges 3 notes and 2 notes, then merges again to form 5. There is no merging to 5 directly.

In a metric unit, the first timepoint has more perceptual importance than the others, and is distinguished in memory. It may not have a stress; stresses are simply tools to demonstrate meter, just like bass and snare drums, or phrase repetitions.

It is unspecified which metric units can be repeated or merged, and what the consequence is; this is a very large absence in the theory. There are also swing notes, which happen before or after a metric timepoint while being associated to it. If the note is before a metric stress, and no other note takes that stress, then the first note acquires the stress.

As a consequence of rhythmic organization, for a retrograde phrase to be recognized, note stresses should be preserved. The typical retrograde transformation of flipping note durations back-to-front does not create a connection that listeners can hear.

This section has no new discoveries; it's all from the literature, and I have done little testing. Yust's book is likely a better source, although I have not yet read it.

Harmony: roughness

Computer and human composers nowadays rely on the twelve-tone scale. Harmonic theory frees us from this limitation.

Helmholtz reported that dissonance is caused by close-by frequencies beating unpleasantly, called "roughness". Frequencies beat more unpleasantly when farther apart, but interact more strongly when close together. These opposing effects multiply, so roughness is greatest when frequencies are separated by a short distance. Also, roughness is a property of individual frequencies, and roughness of the holistic sound is only a consequence.

The roughness of a single frequency A is a weighted average of its roughnesses with other frequencies B. Let B be a sine wave that is active when A is played (B can be A itself). Let EB be the loudness of B.
scaleScale is suggestive of critical bandwidth, but the coefficients are very different. = 19 + .021A
x = (fhigh - flow) / scale
interaction = e-0.84x
divergence = 1 - e-x
roughness(A) = (∑B EB * interaction * divergence) / (∑B EB * interaction).

The roughness of a chord is the loudness-weighted average of the roughness of its frequencies. This contrasts with previous theories which summed roughness rather than averaged it. Sums fit very poorly as the number of frequencies increase. I have not specified the units of loudness because I do not know the correct nonlinear transformation. My current approximation is loudness = energy adjusted by the equal-loudness contour. Perceived roughness scales faster than amplitude2, so this approximation is known to be deficient. There are possible fixes.One fix: in the next section, nonlinearities can be inserted for lattice tones. But even perceived beating of true AM tones is nonlinear. However, since the flawed formula provides a simplification (amplitude-invariance of dissonance), and its failure mode is safe, I have not explored improvements for now.

Plomp & Levelt (1965) measured this behavior, Kameoka and Kuriyagawa (1969) refined the experiments, and Sethares (1998) fit a curve to the data. For two tones of equal loudness, Sethares's curve is more accurate than its method of construction suggests it should be. This section corrects it for multiple tones by changing the additivity and separating interaction from divergence.

Harmony: lattice tones

It is known that two frequencies near a small-integer ratio (such as 400 Hz and 803 Hz) will produce a beating effect (at 3 Hz). This is still true if the 400 Hz tone is heard only in the left ear, and 803 Hz is heard only in the right ear, as a generalization of the "binaural beats" effect.

This is caused by lattice tones. Let f1, f2, f3, ... be a set of frequencies heard simultaneously. At any integer combination of these frequencies, such as f1 + f2 - 3f3, a lattice tone will be generated. A demonstration is that with 630 Hz, 430 Hz, and 530+C Hz at loud volume, beats of 2C Hz are heard. Lattice tones matter because they are a major source of roughness. Lattice tones are not audible on their own; beating only appears when there is a nearby frequency.

The strength of a lattice tone is larger when it is close to its generating frequencies and the sum of the coefficients in its integer combination is small. My approximation for the first factor is a function of (generating frequency) / (lattice tone frequency) - 1, while my approximation for the second factor is a multiplicative 0.7 per term in the sum. These approximations are not accurate.

The corrected formula for roughness(A) in the previous section is to sum over lattice tones B, rather than just frequencies B. Each frequency generates a lattice tone equal to itself, so this is a clean generalization.

Lattice tones provide the solution to a long-open problem. Major and minor triads are 4:5:6 and 10:12:15, which have the same pairwise ratios, so pairwise roughness calculates the same dissonance for them. However, major triads sound more consonant than minor triads, because the middle lattice tone aligns with 5 in the major triad, and misaligns with 12 in the minor triad. This causes extra roughness for the minor triad.

Experiments on lattice tones face considerable difficulties: masking by the comparison tone, interference from neighboring lattice tones, nonlinearity of beat perception, distinguishment from combination tones, and perception being changed by consecutive trials. Threshold of detection experiments fail because the masking is strong and varies with volume (the comparison tone masks itself), and numerical estimates fail because of lack of a consistent scale. This means only pairwise comparisons are possible, which must be designed carefully to maximize similarity in relevant features and lessen interference. Furthermore, since the listener must extract beating in a single tone from beating in other tones, an experienced listener is required.
Fitting the data is also very difficult, as only uncertain pairwise comparisons exist, and no symmetries or commutative operations are satisfied. For example, changing all the frequencies or volumes in concert is expected to produce a consistent change depending only on the changed variable. Unfortunately, the behavior of such a change depends strongly on all the other variables as well, so the function has no easy factorization.

Plomp's excellent 1967 paper Beats of mistuned consonances describes lattice tones with 2 frequencies. It fails to realize the extension to 3+ frequencies and the connection to roughness. This is surprising, since Plomp was the premiere expert on roughness and an outstanding experimentalist. Since nobody spotted the paper's implications, it was regarded as merely a psychoacoustic curiosity, even by Plomp himself.
There is also another effect called "combination tones". Those are generally quieter than lattice tones, except at high frequencies. We will ignore them for this article, but they are occasionally loud enough to matter.

Harmony: prime ratios

Chords in integer ratios (like 2/1 or 3/2) are consonant. One cause is pairwise roughness of the upper partials. But the consonance exists even when all the upper partials are zeroed and roughness predicts no effect. For example, if an instrument's timbre is changed to a sine wave, its harmony is still mostly the same. This indicates that roughness is not the entire solution.

A demonstration is to break a chord up into an arpeggio. The chord and arpeggio have the same notes, yet the chord is dissonant and the arpeggio is consonant. Roughness and lattice tones only exist for simultaneous notes, so another mechanism is working for non-overlapping notes.

A note has dissonance and consonance depending on its ratio to other frequencies, independent of roughness. This consonance is tied to the note, not just a property of the overall sound.

This mechanism is a trained response. Jazz musicians have different responses from classical musicians, and non-music listeners have no response at all. The following formulas are measured on myself, who has an intermediate amount of classical music training.

Fractions with few and small prime factors sound consonant. Let n = 2a 3b 5c 7d. Then dissonance(n) = 2.2b + 5c + 12d + log(n)^2. For the reduced fraction x/y, dissonance(x/y) = dissonance(x) + 0.5 dissonance(y). Inexact ratios round to the closest ratio, with increasing rounding causing increasing dissonance. Dissonance soft-caps around x/y = 7/9 or 7/5, past which ratios cease to matter and rounding takes over. Constants and formulas were found experimentally, and the fit is decent but not great.One systematic bias: when y is divisible by 7, it measures too low for sequential notes. Also, this neglects the effect of pitch distance on harmony.

The notes being compared to are in short-term memory and sensory memory. These are the "harmony context". In the previous paragraph, x is the frequency of the note, and y is the frequency of a note in the harmony context. The dissonance of the note is its average dissonance to its harmony context. Notes in the harmony context are weighted: ratios closer to 1 are weighted more, interference by pitched sounds causes lower weights, weighting follows loudness curves, and there is a nonlinear transformation to linearize the scale. A note belongs to its own harmony context.

The dissonance of a time is a weighted average of the frequencies at that time. Weighting is by loudness. There is an attention boost: new energy counts more than old energy. If you play an arpeggio and hold down each note, then the last note will have priority for harmony, even if the attack and sustain are equal.

Atonal music never has ratios with small primes, so it always caps out on dissonance. On the other hand, unpitched sounds do not participate in harmony, and have neither consonance nor dissonance.

The prime-ratio formula predicts chord progressions with good fit. The different constants for dissonance(x) and dissonance(y) explain the asymmetry in progressions, such as why a V-I cadence resolves while an I-V cadence creates tension.

This theory is not complete. Even though lattice tones justify why minor chords are more dissonant than major chords, it does not justify why minor arpeggios are more dissonant than major arpeggios. However, extending prime ratios from pairs to triads in the obvious way does not work, because only the minor arpeggio behaves this way,and octave transformations not any other utonal"utonal" means the tones are divisors of some frequency, rather than multiples. arpeggio. I have no solution for this except "maybe it's caused by training".
I do not believe this formula is good, though it is reasonably accurate for my ears. It is simply a brute force fit of observed data with numbers 1-16, then a quadratic term to cap off the high end.
Differences in harmony perception between people are large, so experiments should record the subject's level of music experience.

Calculating the strength of each note in the harmony context is moderately hard. One test framework for this question is a sequence of notes A B C ... Z. Ideally, Z is consonant to one note, and dissonant to another note, so that the level of dissonance indicates how much the two notes contribute to the context. In my testing, two notes A and Z separated by time, with no notes in between, have small decay. If the notes are A B B ... B Z, A and Z are as connected as with only one B: A B Z. Unpitched tones between A and Z doesn't interfere.Unpitched sounds are sounds without a clear single frequency. An example is drums. All these properties are shared by existing models for pitch memory. In pitch memory tests, the listener is asked whether A and Z are the same pitch. This is much easier to measure.reference: Deutsch, The Psychology of Music (2013), 7 IV Future testing of qualitative behaviors may wish to test pitch memory, then relate it to harmony, rather than test harmony directly.

Many researchers have found that roughness is insufficient to characterize dissonance. Prior work on resolving the missing part is harmonic entropy. Its fit to data is poor, but its focus on a spreading function is good. If anyone wants to research this subject, future improved formulas should involve spreading functions and a just-so story about training from roughness of harmonic overtones.

Streams

When notes are heard as one voice, they belong to a stream. These factors promote aligning a note into a stream:

  1. Starting when the previous note in the stream ends (legato). If the previous note ends early (as in staccato) or overlaps in time, the note is less likely to be put in the stream.
  2. Similarity to the previous note in the stream in all features (pitch, timbre, loudness, location, etc). Tolerance for differences goes up with higher time separation between note starts.

You can only pay attention to one stream, which becomes the foreground stream. This foreground stream contains the melody and the primary rhythm. You can voluntarily switch your attention to other notes, so the foreground stream depends on the listener, not just the piece. Sometimes, attention is bistable, switching back and forth between different streams. After notes are segregated into streams, these factors promote turning a stream into the foreground stream (giving it attention):

  1. High pitch.
  2. High volume.
  3. Not being repetitive, such as in Ravel's Ondine or La Campanella. That includes repetitions of longer phrases, such as 4 notes being repeated.

Segregation-then-attention is not a clean order; attention also affects stream segregation. There may be no foreground stream, such as in atmospheric music.

The current fit of my stream segregation model to perception is pretty bad.

Streams are researched much further than described here. See Szabó et al. (2016).

Future work

This article leaves out a lot of details and experiments that are not generally interesting (as is the case for most research). Those details aren't proprietary, just boring.

What people care about is not a beautiful model of music cognition, but a working music generator. Results, not theory. Musicians don't read music cognition research because they don't see how it helps composition. Hopefully, this article shows that that music cognition research does have positive impact.

Human composers pick up knowledge faster and easier than it can be coded into programs, which can create an advantage in composition quality. This is why a "first principles" approach is unsuccessful in writing compared to both humans and AI, which are more effective at acquiring knowledge. I've been checking existing compositions to see if this will be an issue in music. I've typically found good and concise explanations for their ideas, implying that hidden factors that humans have are also feasibly replicable by a computer. This process is nonscientific, because without fully specifying the algorithm beforehand, these fits could be caused by subjective choice or ambiguity, or by me missing crucial factors. Still, I did not find obvious limits to using perception as the foundation of a generator.

In the short term, converting completed theories to code should improve the generator greatly.

In the long term, experiments are still needed to develop theory. Anyone can do these experiments. Memory can be affected by many features, like loudness and tempo, but the effects are unclear. How loudness affects harmony is unclear. How harmony decays in the presence of other notes is unclear. Whether tension permits a simple calculation is unclear. Behavior of non-foreground streams is unclear. When notes are played too quickly, their comprehension sharply drops with just a 20% speedup, but the boundaries of this limit are unclear.

I'm not currently supported, so if anyone can offer me a remote coding/research position, please do so! My resume is here. It doesn't have to be in music; I have plenty of other skills, including ML and programming.

Discussion: [hn link]
Chat: Matrix chatroom