Generating music from foundationsAuthor: Kevin Yin

Duration: 20 s

The above button generates music by modeling how humans hear and process notes. Musical behaviors like tonality, melody, and structure arise emergently. No AI is involved. Compare the generator's quality to the state of the art in AI music.

The idea is that musical behaviors are logical consequences of how humans perceive music. For example, roughness and consonance create tonality, expectations create scales, and novelty and order create motifs. Since perception and processing are the source of musical behaviors, appropriate parts of music theory should arise when their foundations are created. This article describes novel properties of perception and shows how they create familiar properties of music composition.

The goal is to implement everything in code to remove any fuzzy or subjective parts. This is worthwhile because the theory specifies how humans decide whether music is good. With this evaluation function, running an optimization algorithm is equivalent to how humans compose music: trying revisions and checking the quality of the result.

The competing approach is to model musical behaviors directly, rather than trying to derive them from perception. In practice, modeling perception uses fewer rules to express the same behaviors, and its mechanistic explanations help prevent corner cases. For example, a large and incomplete list of chord progressions is replaced by a calculation of harmonic consonance. The major downside is that modeling perception is difficult: it's not enough to find behaviors in music; one must also find their causes.

An advantage of working with perception is that its theories are scientific rather than artistic. It makes falsifiable predictions, which enable simple and replicable experiments. The learnings from these experiments are helpful in general, not tailored to a single generator. Fixing one issue won't knock over other hidden causal variables, since the foundations are modeled, not surface correlations. So there are no worries about treading water.

The generator is missing some theories, so it doesn't exhibit their corresponding behaviors. Pitch and harmony are 80% implemented. Memory is 1/2 implemented. The evaluation function has 3/4 theory done, but is 20% implemented. Connections have complete theory and are 70% implemented, but are not turned on. Rhythm, tension, and timbre have minimal progress. Missing areas are either generated randomly (like rhythm), or stubbed with poor approximations.

Here are the core theories:

Roughness and consonance are the source of harmony. Using these calculations frees us from the twelve-tone scale.
A note "makes sense" when the listener hears a set of logical rules that he values. For example, "sounds consonant" is a rule. When phrases are retrieved from memory, past rules change which rules are valued in the present, which creates variation.
Auditory memory specifies which phrases can be retrieved, hence controls short-term and long-term structure. Connections and expectations describe whether phrases are perceived as similar or unrelated. Similar phrases are retrieved from memory.
Humans judge music as good if it simultaneously "makes sense" and is new to them. It's like a demonstration of unexpected structure. This forms the evaluation function, and all factors are calculable.

Other important theories are tension, meter, timbre, pitch memory, and auditory scene analysis.

I strove for conciseness. Sections:

Humans and computers discusses approaches to composition
Memory and structure describes which phrases can be retrieved from memory
Language of music describes connections and expectations
What makes music good states the goal of music, and quantifies how good music is
The remaining sections describe individual domains: tension, rhythm, harmony, and streams

Humans and computers

Human composers grow their abilities by two methods.

The first method is to form intuitive estimates of the hidden rules of music, such as in artificial grammar learning. Listening and experimentation create improvement through experience, but this improvement is not transferable to others. Often, this inexpressibility creates a sense of mystery.^†^†For example, experts on swing could perform it but gave nonsense answers when asked to define it. Their assertions that swing is magical and ineffable contradict our understanding of swing today.

The second method is to guess the hidden rules explicitly, then write these guesses down concretely and clearly. This is much harder, but has an important advantage over intuition: these rules can be conveyed, so they can be improved and dissected by many people, as part of a scientific process.

Musicians use both methods, in theory and practice: learning music theory, and practicing to build intuition.

Computer music generation nowadays focuses on machine learning. Riffusion was a quantum leap in quality and method. So music generation has shifted toward the first method, intuition.

Meanwhile, auditory scientists follow the second method. They experiment on human audio perception and write down their theories and results. They are generally uninvolved with generating music. There's not much mixing between the auditory scientists, who analyze humans, and the computer music researchers, who analyze compositions.

Our generator follows the auditory scientists. Its rules for composition arise indirectly from human audio perception. This is in contrast to other computer music researchers, who derive such rules by hand or by AI, by looking at prior compositions.

Memory and structure

Auditory memory is not random-access; it can only retrieve specific notes. You can try these experiments now to see quirks of your long-term auditory memory:

Memory moves forward, not backward. Consider the lyrics of a song you are familiar with. You are able to recite them. But it's hard to recite the lines in reverse order: last line, then second-to-last line, etc. Similarly, it's hard to recite the words in reverse order. If you try, you will find that you are picking random verses, reciting forward until you reach the previous stopping point, then reversing. This transfers those lines from long-term memory into short-term memory.
You can skip to the next line, and you can recall the beginning of the current line. You can recite the first line of each word. At any word, you can recall the first word of the next line and the first word of the current line.
You can't skip to other positions. To recite every other word, you must retrieve every word and drop the words you don't need; you can't skip to the next next word without retrieving the word in the middle. The same applies if you try to retrieve the first word of every other line: you must retrieve the first word of every line and drop half the words. You also can't retrieve the second word from the next line, unless you retrieve the first word first.
Random positions are hard to retrieve. If you try to list random lyrics, you can only recall a few. In addition, each lyric you find will start at the beginning of a line.
Retrieval can be prompted. If you are given a line from a song, you can name the song and retrieve the next line. This works even if the given words start in the middle of a line, but it's slower. So even though you can't recall arbitrary memory locations on your own, they are still readily accessible when given externally.

Memory for pitch and speech are separate, but they share these properties. Metric stresses and phrase beginnings correspond to line beginnings, and notes correspond to words. (You can try the same tests on an instrumental song you're familiar with.)

All these rules are only for long-term auditory memory. To analyze short-term auditory memory, you can listen to an instrumental-only song (such as this one). Listening will reveal that:

You can recall all the most recent notes. With difficulty, you can replay those notes backwards, which wasn't possible with long-term memory.
You can retrieve phrases from a short while ago, but only starting from the beginning of the phrase or a stress, not from the middle.

Ok, you can turn the song off now.

Music structure is caused by relationships between phrases. Spotting a relationship means finding a phrase in the past that is similar in some way to a phrase now, then determining the difference. To find this relationship:

The old phrase must be in memory. This means the list of memory limitations also controls which phrases can be retrieved. Short term and long term memory have different limits, which influence short term and long term structure.
Out of a pool of many similar and dissimilar phrases, you must be able to find which phrase is most similar, without using attention! Your attention is only able to focus on one phrase at a time, so it can't check all the phrases one-by-one for similarity. And we already know that long-term memory is unable to retrieve arbitrary positions.

Your mind has a special mechanism to retrieve a fuzzy match from memory, without using attention. It tests phrases in parallel, so it can search through a large database.^†^†more precisely, modern Feature Integration Theory says the speed is logarithmic You may not even detect the difference between the found phrase and the template phrase. Since this process uses no conscious thought, it must be simple. We'll discuss the distance function of this fuzzy match in the next section.

Memory decays at different rates for different features. For example, timbre and loudness are recorded in short term memory, but are poorly copied into long term memory. Pitch and rhythm are better remembered.

At the bottom of each section is a button listing the relevant prior research and inspirations, like this one:

Language of music

Given two phrases, you can compare them and determine if they are related or not. This means there is a distance function between phrases.

A note is formed of features such as pitch, brightness, and loudness. Each feature is conceptually "one property only". Behavior is similar to Feature Integration TheoryWolfe (2020). Features are perceived independently, then combined; illusory conjunctions support this explanation.

A connection is a relationship between notes held simultaneously in short term memory.^†^†Notes from long term memory can be retrieved into short term memory, and hence become available for connections. Connections only matter if your mind recognizes them. Here are some example connections:

note B's pitch is 20% greater than note A's pitch. This connection is approximate, so values from 15-25% will be seen as very similar, with closer matches being stronger. Perception becomes less precise with larger separation of feature values. This also works with other features than pitch.
note B's pitch is greater than note A's pitch. This is a ternary flag for a single feature: possible values are greater, equal, and less. For example, if two melodies share these flags for pitch, you will recognize them as similar even if the pitches shift a lot.
note B's position inside its surrounding metric context. This is a connection between a note and a metric context, rather than between two notes. The metric context is a pattern of stresses and subdivisions. Shifting other single notes will usually not change the metric structure; B's position will still be seen as the same.
two features are linked together. For example, if a crescendo in volume matches a rise in pitch.

A connection is decorated with two properties:

perceptual importance, representing how much it matters. For example, connections are stronger between adjacent notes in time, and weaker between far-apart notes. They're also stronger between notes within a stream, and weaker when crossing streams. This is a non-negative number.
order. Order is caused by meeting expectations. Base expectations between nearby notes are feature-wise similarity and harmonic consonance. Other expectations are described later. Order is a number in [-1, 1].^†^†The logical metric of "ordered" is Kolmogorov complexity, but it does not hold here. Your mind can't calculate probabilities unless it considers all the options not presented - which it does not do. Intuitively, we know this thanks to Lichtenstein and Slovic's experiments on Construction of Preference.

Memory recall pairs up notes in the past to notes in the present. This recall is represented by a ladder of association, where each rung pairs two notes. The left of the ladder represents notes in the past, the right of the ladder represents notes in the present, and the ladder says "these notes are associated to each other". By the previous section, memory is recalled sequentially. It is the same for ladders, which grow one rung at a time as notes are sequentially recalled and associated to the present. Each ladder corresponds to one timepoint being recalled, travelling forward in time, and multiple ladders can be active simultaneously. Notes can be omitted from a ladder, but each side can only have one copy of each note.

The effect of a ladder is that connections on the left side create expectations for similar connections to appear on the right side. The strength of the expectation is increased by successful expectations of the same type. For example, expectations in pitch do not affect expectations in timbre, whether they are successful or not.

The strength of the ladder is how willing the mind is to continue retrieving notes from memory. It is equal to the sum of the strengths of its expectations. When this strength falls to 0, the retrieval stops and the ladder terminates. Ladders have an initial setup cost equal to the difficulty of retrieval from memory. For example, if two phrases are 50 notes apart, it is much harder to retrieve the first phrase than if they are 4 notes apart. The ladder will not be recognized until enough expectations have been fulfilled.

A connection's order is determined by whether expectations support or contradict it. For example, in this phrase of 8 notes, the default expectation is that the first 4 notes are repeated by the second 4 notes. The last note is not an exact repeat, but its perceived order is still higher than if the first 4 notes did not exist at all. Hence, the expectation created by the first 4 notes boosts the order of the last note's pitch.

Some notes are easier to recall from memory: phrase beginnings, stresses, and beginnings of previous ladders. Observationally, when composers copy prior snippets from inside a work, the copies usually start at these notes. This accords with theory, which predicts that retrieval from these notes is more efficient, hence ladders are easier to build. Likewise, composers frequently terminate the copy at a phrase ending, which is where connections either cease or become ambiguous.

Connections and ladders explain many behaviors of music structure in a parsimonious way:

why repeated transformations tend to continue. An example is the C Major scale. Each note is obtained from the previous by shifting the time and pitch. The ladder's rungs are between adjacent notes. For each note, its connection to its previous note predicts the same connection to the next note. Pitch sweeps, crescendos, accelerandos, and filter sweeps are all explained here.
similarity between bars, such as in the beginning of Dr. Gradus ad Parnassum. There are two ladders: one with rungs to 4 notes back, and one with rungs to 8 notes back. The first ladder is initially stronger because its setup cost is smaller. It decreases in strength over time, because the second ladder of 8 notes is a more accurate predictor of pitch contour. This is also an example of a repeated transform, with a longer distance between notes in a rung.
theme and variations. Some expectations are kept, others are not. For example, Richard Wagner's leitmotifs place old motifs in new territory. Changing a song from major to minor is another example, where pitch relationships are kept, while harmony is not.
copying notes from one stream or stress level only. For example, one might copy the melody of a phrase into a new context, without copying the bassline. This is possible because these notes are more strongly connected to each other.
nested phrases. In this example, a phrase is repeated, with each note replaced by groups of 3 notes. This is derived from two ladders: one ladder from the beginning of each group to the original phrase, and one ladder with rungs between every 3rd note.^†^†This ladder has a special property, that each group of 3 notes is not related to the next group, except between their first notes.
combining features. When combining the melody from one phrase with the rhythm from another phrase, you can recognize both sources.

While connections and ladders have structure that cause the above behaviors, they do not predict why these behaviors should be preferred. All they predict is what is perceived as "ordered" and which connections are considered important. Just because a note is ordered does not mean it is good. The final piece of this puzzle is in the next section.

Note that connections are evaluated within one side of a ladder, not across the ladder, which may transform their perception in nonobvious ways. For example, if we merge phrases 10 3 9 4 8 5 7 6 and 6 7 5 8 4 9 3 10 into 10-6 3-7 9-5 4-8 8-4 5-9 7-3 6-10, the pitch contours of the components are not recognizable in the merged phrase. This is because stream segregation forces all the top pitches to connect to each other, so you cannot perceive the connections between alternating high-low notes.

What makes music good

The goal of music is to learn, forming a mental model of sound. The language of connections is a set of axioms, and music is a demonstration of the emergent structure. This is a specific type of learning, caused only by improvement of a mental model for music, like the training of a neural network. Other types of learning, such as memorizing facts, do not evoke the same emotion. Music is good when it maximizes this metric: training = [distance from what you know] * [order].^†^†"training" is a poor choice of terminology, but I didn't find a better word

Distance from what you know = how poorly it is encoded by your existing mental model. This represents what you don't already understand. For example, you have (likely) already memorized Twinkle Twinkle Little Star, so it's fully understood and encoded, and its distance is 0. Simple transforms won't affect it: if this song is heard one octave higher, you still understand it in relation to the original song, so distance is still 0. But if the song is reharmonized, there is no obvious transform to construct it, so the distance is positive. A remix may not be very surprising, so it has low distance, whereas a counterexample to a foundational belief will have high distance.
Order = if it "makes sense". For connections, order = high quantity and order of unique connections. It matters that the listener personally perceives these connections and considers them highly ordered.

Training is high when distance and order are simultaneously high. This is for an idea that makes sense, but that you did not believe could exist, or that didn't fit what you already knew. Another way to see it is as an interesting surprise.^†^†This is also how people judge math theorems.

Since learning is the goal, it is natural that the metric should reward updating the mental model. However, difficulty in encoding the surprise in the model does not reduce perception of training.^†^†in fact, it may actually increase perception of training, as it makes the idea seem mysterious instead of simple An example is a stage magician who shows off a cool trick that the audience cannot figure out. The audience's model update is only that it is possible, not how it happens, but the trick is still cool. The distance to the model is sufficient; actually learning the mechanics is optional. This means hearing good music does not guarantee the mental model is making progress.

Music preferences changing over time is a sign of personal growth. As your model improves, you comprehend old songs (thus making old songs boring) and perceive more connections (thus opening up new songs). One example is harmony; people initially recognize no intervals, even octavesJacoby et al. (2019). As they listen to music, they start to hear common intervals as harmonic, so music relying on these intervals becomes accessible and interesting. Jazz listeners assign order to even more intervals, which lets them enjoy music with those intervals.^†^†theory does not say whether assigning order to those intervals is good or bad

Many works have structure that listeners cannot hear. For example, Bach's Goldberg Canons follow simple rules of retrograde, inversion, and time shifting. But these rules are not discovered during listening unless the listener is told to look for them. Thus, these rules do not contribute to order. A common sign of incomplete comprehension is repetition, either within a piece, or by re-listening to the same piece. Fuzzy matching and memory are limited, which forms obstacles to spotting connections.

For instrumental music, there are four major sources of training.

The first source is timbre. Perception of quality in the first 10 seconds of a song is driven mostly by timbre. Listeners learn both individual timbres and the paths between timbres. For example, a lowpass filter applied to a violin may not be interesting. But if the filter's cutoff frequency changes over time, it shows a smooth path between different timbres that the listener may not have known. Examples:

Low frequencies correspond to large objects, odd harmonics correspond to instruments open at one end, and LFO-modulated frequencies correspond to wobbly objects.
A bucket drummer demonstrates the versatility of his instrument. The visual component is important; normal drums can also reproduce these sounds, but they are expected to be versatile, so there is no magic. Using multiple buckets would diminish the magic, as it increases the expectation of freedom in timbre. The restrictiveness of the instrument is necessary, as it creates the surprise.
Electronic instruments can create nonphysical sounds. For example, dispersion represents how stiff an element is, and increasing dispersion is like dynamically thickening a steel bar.

The second source is limits and ambiguities of mental processing. This is when the language of music fails to characterize music properly, but expectations are still met. These deviations from structure create moments of discovery. Examples:

Attention can only focus on one stream, but the foreground stream is ambiguous if phrases overlap, such as in the beginning of Fur Elise.
Pitch and frequency lose their correspondence in a Shepard tone.
The distinction between chords and arpeggios is imperfect, as the boundary between separate and simultaneous notes is fuzzy.
Chords and single notes may be indistinguishable with timbre manipulations, so chord decomposition is not unique.
Note attacks may be too soft to produce distinct note beginnings.

The third source is ties to external traits. This category is out of scope for this article, since it involves complicated problems like language modeling. Examples:

Lyrics, since our scope is instrumental music.
When the frequencies of music form a coherent picture, creating an unexpected relationship.
Relationships between music and visuals, such as when beats sync to events in a music video, or tension and release sync to important events. Without AI, these events must be manually labeled.
Atmospheric music, in which the music matches the mood of a movie. This mood can also be manually labeled.

The last source is connections. Examples of training from connections:

a C Major scale has high order, from the consonance and repeated pitch shift. But the listener has heard this scale before, so it is already completely understood and training is zero.
random notes have zero order, because there are no rules.
suppose there are four phrases A, B, C and D, satisfying the analogy: A is to B as C is to D. Order is positive since the square of connections has one more connection than a tree would, making it more rigid. Even though this is a simple method, the mind does not fully predict the fourth phrase beforehand, so training is positive. This method is common in pop music.Bimbot et al. (2016)
a symphony with three movements. The first two movements are dissimilar. In the third movement, the previous two movements are repeated together and harmonize perfectly. Order is high, because repetitions have high order and are easy to spot. Distance is high, since the listener didn't expect this. Thus, training is high.
Weird Al's music changes the lyrics of existing songs. The connection is simple and the contrast to the original song creates surprise. Thus, training is high. (We can't model this numerically because lyrics are out of scope.)

The naive way to generate music is to create connections of varying complexity. High complexity connections are likely to be new, low complexity connections are ordered, and sometimes these connections link together to be interesting. For now, this is what the generator does, since implementation of training is unfinished.

However, connections are not independent. Pitch affects stream segregation, and pitch contour and analog pitch interact. Timing and rhythm interact with each other, and also affect information delivery and lyrics. Timbre affects chordalness, pitch, and brightness. Most features affect tension. Harmony is indirectly created, as it is derived from frequency. As a consequence of these interactions, connections must be measured after creation, rather than predicted in advance.

The better way to generate music is as an optimization problem with training as the metric. Some simplifying assumptions are still necessary. The listener's model only contains material he has previously heard from the same piece (or past pieces the algorithm knows about); these share ladders with the expectations. The distance function specifies the obvious transformations as modifications of independent features. Comprehension is influenced by memory limits, complexity of the connections, attention, and time (slower notes are easier to process). Order is calculated as in the previous section. Order and distance must happen together in a single feature. For example, if the pitches are random and hence novel, and the rhythm is highly ordered, there is still no training, because the novelty and order are in different features. Tying training to individual features is valid to the extent that features are independently modifiable, and that the listener understands this independence. This handles training from connections.

The other two sources of training, timbre and ambiguities, appear to be a collection of special cases. They obey the same principle and calculations, but the individual instances are too heterogeneous to have an elegant theory.

Learning in music generalizes to other fields in the concept of "fun", consisting of self-improvement of (mental) capabilities.^†^†This definition requires much more specification, with 400 pages of unpublished research laying out the technicalities and boundaries. But these details have little relevance to music. Fun is the core of fiction novels, games, art, movies, humor, and music. For example, in game design, players have fun when they perceive:

their effort leads to improvement in ability
that improved ability matters, leading to an improved outcome either inside or outside the game

Music is fun, so the mind believes that listening to music will increase some ability. Musicians consistently do better in someAgres and Krumhansl (2008) domains, mainly tonality and tonal memory, but real-world benefits are inconclusiveNeves et al. (2022).

Tension

Tension is the expectation of imminent future training, where training is as defined in the previous section. The outcome of tension is a focusing of attention and processing.

If there is no expectation of training, there is no tension. Random notes are unordered and cause dissonance, but no tension. The hum of a power line is ordered and causes boredom, but no tension.

Tension is only an expectation of training, not a guarantee. If the training appears as expected, it causes a resolution. If the training fails to appear, the tension will continue or disappear unsatisfactorily.

Observationally, music sections often end in a resolution, rather than tension. Theory predicts that this is efficient: if ideas were appended to the end without the corresponding connections, those ideas could be removed without decreasing training.^†^†this ignores mood-setting music, which uses tension for a different purpose Also, since the resolution has strong connections to prior notes, a separation between the resolution and its connected notes would reduce its power. This feature of tension and resolution is a result of process, not a fundamental property.

Here are some sources of tension:

A note before a stress creates expectation for a note on that stress. If the following note doesn't appear, it's syncopation.
Notes which only connect to adjacent notes have low long-term connections, which creates an expectation of higher connections later.
A spotlight on an imminent change often creates tension, since it is assumed the change will be interesting. One way to create an expectation of imminent change is an unsustainable monotonic trend. This trend could either imply a singularity or prevent anything interesting. Examples:
- increasing volume, which tells you to pay attention. Volume increases cannot continue forever.
- decreasing volume for all instruments in a curve that leads to 0: if nothing can be heard, then nothing interesting can happen, so you expect a deviation (unless it's the end of the piece)
- increasing speed
- decreasing speed in a curve that leads to 0
- pitch trends, either upward or downward. Examples: pitch sweeps, scales, phrases repeated at higher and higher pitches.
- filter cutoffs moving to muffle all sound
- repetition, if a change is expected. It may be a lack of activity, such as a note being held for a while, or a simple repetition.
- adding instruments until complexity starts becoming overwhelming
- removing essential instruments, such as removing all the instruments playing the melody
- increasing complexity in time, making the music hard to understand. Such as quarter notes turning into 8ths, then 16ths, then 32nds.
- increasing suffusion of noise, making the music hard to hear
- all instruments moving away from you (increasing wet reverb)
Interruption of a consistent pattern also creates an expectation of change. An example is omitting a note from a consistent bassline.

This link has audio examples of tension.

There are interesting ways to violate tension, using the third source of training, limits and ambiguities of processing. For example, a constantly increasing speed appears unsustainable, but a Risset rhythm can continue forever.

Tension theory is still imprecise. Separating cause and effect is necessary for these co-occurring behaviors: expectation of imminent training, focusing of processing/training, and expectation of change.

Rhythm

Rhythm is decided by the beginnings of notes, and note endings are unimportant. These note beginnings are placed into structures we will call "metric units". There are three main operations listeners can hear:

repeat: given a metric unit, duplicate it
divide: given a metric unit, divide it into metric units of equal size. Repeated subdivision is possible.
merge: given a set of 2, 3, or 4 contiguous metric units, merge them into a new metric unit.^†^†5 is not possible, seemingly.

Division automatically causes merging. If the division is not a mergeable number, then no metric unit is created.

For example, Take 5, with a 5/5 meter, merges 3 notes and 2 notes, then merges again to form 5. There is no merging to 5 directly.

In a metric unit, the first timepoint has more perceptual importance than the others, and is distinguished in memory. It may not have a stress; stresses are simply tools to demonstrate meter, just like bass and snare drums, or phrase repetitions.

It is unspecified which metric units can be repeated or merged, and what the consequence is; this is a very large absence in the theory. There are also swing notes, which happen before or after a metric timepoint while being associated to it. If the note is before a metric stress, and no other note takes that stress, then the first note acquires the stress.

As a consequence of rhythmic organization, for a retrograde phrase to be recognized, note stresses should be preserved. The typical retrograde transformation of flipping note durations back-to-front does not create a connection that listeners can hear.

Harmony: roughness

Computer and human composers nowadays rely on the twelve-tone scale. Harmonic theory frees us from this limitation.

Helmholtz reported that dissonance is caused by close-by frequencies beating unpleasantly, called "roughness". Frequencies beat more unpleasantly when farther apart, but interact more strongly when close together. These opposing effects multiply, so roughness is greatest when frequencies are separated by a short distance. Also, roughness is a property of individual frequencies, and roughness of the holistic sound is only a consequence.

Harmony: lattice tones

It is known that two frequencies near a small-integer ratio (such as 400 Hz and 803 Hz) will produce a beating effect (at 3 Hz). This is still true if the 400 Hz tone is heard only in the left ear, and 803 Hz is heard only in the right ear, as a generalization of the "binaural beats" effect.

This is caused by lattice tones. Let f₁, f₂, f₃, ... be a set of frequencies heard simultaneously. At any integer combination of these frequencies, such as f₁ + f₂ - 3f₃, a lattice tone will be generated. A demonstration is that with 630 Hz, 430 Hz, and 530+C Hz at loud volume, beats of 2C Hz are heard. Lattice tones matter because they are a major source of roughness. Lattice tones are not audible on their own; beating only appears when there is a nearby frequency.

The strength of a lattice tone is larger when it is close to its generating frequencies and the sum of the coefficients in its integer combination is small. My approximation for the first factor is a function of (generating frequency) / (lattice tone frequency) - 1, while my approximation for the second factor is a multiplicative 0.7 per term in the sum. These approximations are not accurate.

The corrected formula for roughness(A) in the previous section is to sum over lattice tones B, rather than just frequencies B. Each frequency generates a lattice tone equal to itself, so this is a clean generalization.

Lattice tones provide the solution to a long-open problem. Major and minor triads are 4:5:6 and 10:12:15, which have the same pairwise ratios, so pairwise roughness calculates the same dissonance for them. However, major triads sound more consonant than minor triads, because the middle lattice tone aligns with 5 in the major triad, and misaligns with 12 in the minor triad. This causes extra roughness for the minor triad.

Harmony: prime ratios

Chords in integer ratios (like 2/1 or 3/2) are consonant. One cause is pairwise roughness of the upper partials. But the consonance exists even when all the upper partials are zeroed and roughness predicts no effect. For example, if an instrument's timbre is changed to a sine wave, its harmony is still mostly the same. This indicates that roughness is not the entire solution.

A demonstration is to break a chord up into an arpeggio. The chord and arpeggio have the same notes, yet the chord is dissonant and the arpeggio is consonant. Roughness and lattice tones only exist for simultaneous notes, so another mechanism is working for non-overlapping notes.

A note has dissonance and consonance depending on its ratio to other frequencies, independent of roughness. This consonance is tied to the note, not just a property of the overall sound.

This mechanism is a trained response. Jazz musicians have different responses from classical musicians, and non-music listeners have no response at all. The following formulas are measured on myself, who has an intermediate amount of classical music training.

Fractions with few and small prime factors sound consonant. Let n = 2^a 3^b 5^c 7^d. Then dissonance(n) = 2.2b + 5c + 12d + log(n)^2. For the reduced fraction x/y, dissonance(x/y) = dissonance(x) + 0.5 dissonance(y). Inexact ratios round to the closest ratio, with increasing rounding causing increasing dissonance. Dissonance soft-caps around x/y = 7/9 or 7/5, past which ratios cease to matter and rounding takes over. Constants and formulas were found experimentally, and the fit is decent but not great.^†^†One systematic bias: when y is divisible by 7, it measures too low for sequential notes. Also, this neglects the effect of pitch distance on harmony.

The notes being compared to are in short-term memory and sensory memory. These are the "harmony context". In the previous paragraph, x is the frequency of the note, and y is the frequency of a note in the harmony context. The dissonance of the note is its average dissonance to its harmony context. Notes in the harmony context are weighted: ratios closer to 1 are weighted more, interference by pitched sounds causes lower weights, weighting follows loudness curves, and there is a nonlinear transformation to linearize the scale. A note belongs to its own harmony context.

The dissonance of a time is a weighted average of the frequencies at that time. Weighting is by loudness. There is an attention boost: new energy counts more than old energy. If you play an arpeggio and hold down each note, then the last note will have priority for harmony, even if the attack and sustain are equal.

Atonal music never has ratios with small primes, so it always caps out on dissonance. On the other hand, unpitched sounds do not participate in harmony, and have neither consonance nor dissonance.

The prime-ratio formula predicts chord progressions with good fit. The different constants for dissonance(x) and dissonance(y) explain the asymmetry in progressions, such as why a V-I cadence resolves while an I-V cadence creates tension.

Streams

When notes are heard as one voice, they belong to a stream. These factors promote aligning a note into a stream:

Starting when the previous note in the stream ends (legato). If the previous note ends early (as in staccato) or overlaps in time, the note is less likely to be put in the stream.
Similarity to the previous note in the stream in all features (pitch, timbre, loudness, location, etc). Tolerance for differences goes up with higher time separation between note starts.

You can only pay attention to one stream, which becomes the foreground stream. This foreground stream contains the melody and the primary rhythm. You can voluntarily switch your attention to other notes, so the foreground stream depends on the listener, not just the piece. Sometimes, attention is bistable, switching back and forth between different streams. After notes are segregated into streams, these factors promote turning a stream into the foreground stream (giving it attention):

High pitch.
High volume.
Not being repetitive, such as in Ravel's Ondine or La Campanella. That includes repetitions of longer phrases, such as 4 notes being repeated.

Segregation-then-attention is not a clean order; attention also affects stream segregation. There may be no foreground stream, such as in atmospheric music.

The current fit of my stream segregation model to perception is pretty bad.

Future work

This article leaves out a lot of details and experiments that are not generally interesting (as is the case for most research). Those details aren't proprietary, just boring.

What people care about is not a beautiful model of music cognition, but a working music generator. Results, not theory. Musicians don't read music cognition research because they don't see how it helps composition. Hopefully, this article shows that that music cognition research does have positive impact.

Human composers pick up knowledge faster and easier than it can be coded into programs, which can create an advantage in composition quality. This is why a "first principles" approach is unsuccessful in writing compared to both humans and AI, which are more effective at acquiring knowledge. I've been checking existing compositions to see if this will be an issue in music. I've typically found good and concise explanations for their ideas, implying that hidden factors that humans have are also feasibly replicable by a computer. This process is nonscientific, because without fully specifying the algorithm beforehand, these fits could be caused by subjective choice or ambiguity, or by me missing crucial factors. Still, I did not find obvious limits to using perception as the foundation of a generator.

In the short term, converting completed theories to code should improve the generator greatly.

In the long term, experiments are still needed to develop theory. Anyone can do these experiments. Memory can be affected by many features, like loudness and tempo, but the effects are unclear. How loudness affects harmony is unclear. How harmony decays in the presence of other notes is unclear. Whether tension permits a simple calculation is unclear. Behavior of non-foreground streams is unclear. When notes are played too quickly, their comprehension sharply drops with just a 20% speedup, but the boundaries of this limit are unclear.

I'm not currently supported, so if anyone can offer me a remote coding/research position, please do so! My resume is here. It doesn't have to be in music; I have plenty of other skills, including ML and programming.

Discussion: [hn link]
Chat: Matrix chatroom