UAX #44: Unicode Character Database.html 314 KB

1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677787980818283848586878889909192939495969798991001011021031041051061071081091101111121131141151161171181191201211221231241251261271281291301311321331341351361371381391401411421431441451461471481491501511521531541551561571581591601611621631641651661671681691701711721731741751761771781791801811821831841851861871881891901911921931941951961971981992002012022032042052062072082092102112122132142152162172182192202212222232242252262272282292302312322332342352362372382392402412422432442452462472482492502512522532542552562572582592602612622632642652662672682692702712722732742752762772782792802812822832842852862872882892902912922932942952962972982993003013023033043053063073083093103113123133143153163173183193203213223233243253263273283293303313323333343353363373383393403413423433443453463473483493503513523533543553563573583593603613623633643653663673683693703713723733743753763773783793803813823833843853863873883893903913923933943953963973983994004014024034044054064074084094104114124134144154164174184194204214224234244254264274284294304314324334344354364374384394404414424434444454464474484494504514524534544554564574584594604614624634644654664674684694704714724734744754764774784794804814824834844854864874884894904914924934944954964974984995005015025035045055065075085095105115125135145155165175185195205215225235245255265275285295305315325335345355365375385395405415425435445455465475485495505515525535545555565575585595605615625635645655665675685695705715725735745755765775785795805815825835845855865875885895905915925935945955965975985996006016026036046056066076086096106116126136146156166176186196206216226236246256266276286296306316326336346356366376386396406416426436446456466476486496506516526536546556566576586596606616626636646656666676686696706716726736746756766776786796806816826836846856866876886896906916926936946956966976986997007017027037047057067077087097107117127137147157167177187197207217227237247257267277287297307317327337347357367377387397407417427437447457467477487497507517527537547557567577587597607617627637647657667677687697707717727737747757767777787797807817827837847857867877887897907917927937947957967977987998008018028038048058068078088098108118128138148158168178188198208218228238248258268278288298308318328338348358368378388398408418428438448458468478488498508518528538548558568578588598608618628638648658668678688698708718728738748758768778788798808818828838848858868878888898908918928938948958968978988999009019029039049059069079089099109119129139149159169179189199209219229239249259269279289299309319329339349359369379389399409419429439449459469479489499509519529539549559569579589599609619629639649659669679689699709719729739749759769779789799809819829839849859869879889899909919929939949959969979989991000100110021003100410051006100710081009101010111012101310141015101610171018101910201021102210231024102510261027102810291030103110321033103410351036103710381039104010411042104310441045104610471048104910501051105210531054105510561057105810591060106110621063106410651066106710681069107010711072107310741075107610771078107910801081108210831084108510861087108810891090109110921093109410951096109710981099110011011102110311041105110611071108110911101111111211131114111511161117111811191120112111221123112411251126112711281129113011311132113311341135113611371138113911401141114211431144114511461147114811491150115111521153115411551156115711581159116011611162116311641165116611671168116911701171117211731174117511761177117811791180118111821183118411851186118711881189119011911192119311941195119611971198119912001201120212031204120512061207120812091210121112121213121412151216121712181219122012211222122312241225122612271228122912301231123212331234123512361237123812391240124112421243124412451246124712481249125012511252125312541255125612571258125912601261126212631264126512661267126812691270127112721273127412751276127712781279128012811282128312841285128612871288128912901291129212931294129512961297129812991300130113021303130413051306130713081309131013111312131313141315131613171318131913201321132213231324132513261327132813291330133113321333133413351336133713381339134013411342134313441345134613471348134913501351135213531354135513561357135813591360136113621363136413651366136713681369137013711372137313741375137613771378137913801381138213831384138513861387138813891390139113921393139413951396139713981399140014011402140314041405140614071408140914101411141214131414141514161417141814191420142114221423142414251426142714281429143014311432143314341435143614371438143914401441144214431444144514461447144814491450145114521453145414551456145714581459146014611462146314641465146614671468146914701471147214731474147514761477147814791480148114821483148414851486148714881489149014911492149314941495149614971498149915001501150215031504150515061507150815091510151115121513151415151516151715181519152015211522152315241525152615271528152915301531153215331534153515361537153815391540154115421543154415451546154715481549155015511552155315541555155615571558155915601561156215631564156515661567156815691570157115721573157415751576157715781579158015811582158315841585158615871588158915901591159215931594159515961597159815991600160116021603160416051606160716081609161016111612161316141615161616171618161916201621162216231624162516261627162816291630163116321633163416351636163716381639164016411642164316441645164616471648164916501651165216531654165516561657165816591660166116621663166416651666166716681669167016711672167316741675167616771678167916801681168216831684168516861687168816891690169116921693169416951696169716981699170017011702170317041705170617071708170917101711171217131714171517161717171817191720172117221723172417251726172717281729173017311732173317341735173617371738173917401741174217431744174517461747174817491750175117521753175417551756175717581759176017611762176317641765176617671768176917701771177217731774177517761777177817791780178117821783178417851786178717881789179017911792179317941795179617971798179918001801180218031804180518061807180818091810181118121813181418151816181718181819182018211822182318241825182618271828182918301831183218331834183518361837183818391840184118421843184418451846184718481849185018511852185318541855185618571858185918601861186218631864186518661867186818691870187118721873187418751876187718781879188018811882188318841885188618871888188918901891189218931894189518961897189818991900190119021903190419051906190719081909191019111912191319141915191619171918191919201921192219231924192519261927192819291930193119321933193419351936193719381939194019411942194319441945194619471948194919501951195219531954195519561957195819591960196119621963196419651966196719681969197019711972197319741975197619771978197919801981198219831984198519861987198819891990199119921993199419951996199719981999200020012002200320042005200620072008200920102011201220132014201520162017201820192020202120222023202420252026202720282029203020312032203320342035203620372038203920402041204220432044204520462047204820492050205120522053205420552056205720582059206020612062206320642065206620672068206920702071207220732074207520762077207820792080208120822083208420852086208720882089209020912092209320942095209620972098209921002101210221032104210521062107210821092110211121122113211421152116211721182119212021212122212321242125212621272128212921302131213221332134213521362137213821392140214121422143214421452146214721482149215021512152215321542155215621572158215921602161216221632164216521662167216821692170217121722173217421752176217721782179218021812182218321842185218621872188218921902191219221932194219521962197219821992200220122022203220422052206220722082209221022112212221322142215221622172218221922202221222222232224222522262227222822292230223122322233223422352236223722382239224022412242224322442245224622472248224922502251225222532254225522562257225822592260226122622263226422652266226722682269227022712272227322742275227622772278227922802281228222832284228522862287228822892290229122922293229422952296229722982299230023012302230323042305230623072308230923102311231223132314231523162317231823192320232123222323232423252326232723282329233023312332233323342335233623372338233923402341234223432344234523462347234823492350235123522353235423552356235723582359236023612362236323642365236623672368236923702371237223732374237523762377237823792380238123822383238423852386238723882389239023912392239323942395239623972398239924002401240224032404240524062407240824092410241124122413241424152416241724182419242024212422242324242425242624272428242924302431243224332434243524362437243824392440244124422443244424452446244724482449245024512452245324542455245624572458245924602461246224632464246524662467246824692470247124722473247424752476247724782479248024812482248324842485248624872488248924902491249224932494249524962497249824992500250125022503250425052506250725082509251025112512251325142515251625172518251925202521252225232524252525262527252825292530253125322533253425352536253725382539254025412542254325442545254625472548254925502551255225532554255525562557255825592560256125622563256425652566256725682569257025712572257325742575257625772578257925802581258225832584258525862587258825892590259125922593259425952596259725982599260026012602260326042605260626072608260926102611261226132614261526162617261826192620262126222623262426252626262726282629263026312632263326342635263626372638263926402641264226432644264526462647264826492650265126522653265426552656265726582659266026612662266326642665266626672668266926702671267226732674267526762677267826792680268126822683268426852686268726882689269026912692269326942695269626972698269927002701270227032704270527062707270827092710271127122713271427152716271727182719272027212722272327242725272627272728272927302731273227332734273527362737273827392740274127422743274427452746274727482749275027512752275327542755275627572758275927602761276227632764276527662767276827692770277127722773277427752776277727782779278027812782278327842785278627872788278927902791279227932794279527962797279827992800280128022803280428052806280728082809281028112812281328142815281628172818281928202821282228232824282528262827282828292830283128322833283428352836283728382839284028412842284328442845284628472848284928502851285228532854285528562857285828592860286128622863286428652866286728682869287028712872287328742875287628772878287928802881288228832884288528862887288828892890289128922893289428952896289728982899290029012902290329042905290629072908290929102911291229132914291529162917291829192920292129222923292429252926292729282929293029312932293329342935293629372938293929402941294229432944294529462947294829492950295129522953295429552956295729582959296029612962296329642965296629672968296929702971297229732974297529762977297829792980298129822983298429852986298729882989299029912992299329942995299629972998299930003001300230033004300530063007300830093010301130123013301430153016301730183019302030213022302330243025302630273028302930303031303230333034303530363037303830393040304130423043304430453046304730483049305030513052305330543055305630573058305930603061306230633064306530663067306830693070307130723073307430753076307730783079308030813082308330843085308630873088308930903091309230933094309530963097309830993100310131023103310431053106310731083109311031113112311331143115311631173118311931203121312231233124312531263127312831293130313131323133313431353136313731383139314031413142314331443145314631473148314931503151315231533154315531563157315831593160316131623163316431653166316731683169317031713172317331743175317631773178317931803181318231833184318531863187318831893190319131923193319431953196319731983199320032013202320332043205320632073208320932103211321232133214321532163217321832193220322132223223322432253226322732283229323032313232323332343235323632373238323932403241324232433244324532463247324832493250325132523253325432553256325732583259326032613262326332643265326632673268326932703271327232733274327532763277327832793280328132823283328432853286328732883289329032913292329332943295329632973298329933003301330233033304330533063307330833093310331133123313331433153316331733183319332033213322332333243325332633273328332933303331333233333334333533363337333833393340334133423343334433453346334733483349335033513352335333543355335633573358335933603361336233633364336533663367336833693370337133723373337433753376337733783379338033813382338333843385338633873388338933903391339233933394339533963397339833993400340134023403340434053406340734083409341034113412341334143415341634173418341934203421342234233424342534263427342834293430343134323433343434353436343734383439344034413442344334443445344634473448344934503451345234533454345534563457345834593460346134623463346434653466346734683469347034713472347334743475347634773478347934803481348234833484348534863487348834893490349134923493349434953496349734983499350035013502350335043505350635073508350935103511351235133514351535163517351835193520352135223523352435253526352735283529353035313532353335343535353635373538353935403541354235433544354535463547354835493550355135523553355435553556355735583559356035613562356335643565356635673568356935703571357235733574357535763577357835793580358135823583358435853586358735883589359035913592359335943595359635973598359936003601360236033604360536063607360836093610361136123613361436153616361736183619362036213622362336243625362636273628362936303631363236333634363536363637363836393640364136423643364436453646364736483649365036513652365336543655365636573658365936603661366236633664366536663667366836693670367136723673367436753676367736783679368036813682368336843685368636873688368936903691369236933694369536963697369836993700370137023703370437053706370737083709371037113712371337143715371637173718371937203721372237233724372537263727372837293730373137323733373437353736373737383739374037413742374337443745374637473748374937503751375237533754375537563757375837593760376137623763376437653766376737683769377037713772377337743775377637773778377937803781378237833784378537863787378837893790379137923793379437953796379737983799380038013802380338043805380638073808380938103811381238133814381538163817381838193820382138223823382438253826382738283829383038313832383338343835383638373838383938403841384238433844384538463847384838493850385138523853385438553856385738583859386038613862386338643865386638673868386938703871387238733874387538763877387838793880388138823883388438853886388738883889389038913892389338943895389638973898389939003901390239033904390539063907390839093910391139123913391439153916391739183919392039213922392339243925392639273928392939303931393239333934393539363937393839393940394139423943394439453946394739483949395039513952395339543955395639573958395939603961396239633964396539663967396839693970397139723973397439753976397739783979398039813982398339843985398639873988398939903991399239933994399539963997399839994000400140024003400440054006400740084009401040114012401340144015401640174018401940204021402240234024402540264027402840294030403140324033403440354036403740384039404040414042404340444045404640474048404940504051405240534054405540564057405840594060406140624063406440654066406740684069407040714072407340744075407640774078407940804081408240834084408540864087408840894090409140924093409440954096409740984099410041014102410341044105410641074108410941104111411241134114411541164117411841194120412141224123412441254126412741284129413041314132413341344135413641374138413941404141414241434144414541464147414841494150415141524153415441554156415741584159416041614162416341644165416641674168416941704171417241734174417541764177417841794180418141824183418441854186418741884189419041914192419341944195419641974198419942004201420242034204420542064207420842094210421142124213421442154216421742184219422042214222422342244225422642274228422942304231423242334234423542364237423842394240424142424243424442454246424742484249425042514252425342544255425642574258425942604261426242634264426542664267426842694270427142724273427442754276427742784279428042814282428342844285428642874288428942904291429242934294429542964297429842994300430143024303430443054306430743084309431043114312431343144315431643174318431943204321432243234324432543264327432843294330433143324333433443354336433743384339434043414342434343444345434643474348434943504351435243534354435543564357435843594360436143624363436443654366436743684369437043714372437343744375437643774378437943804381438243834384438543864387438843894390439143924393439443954396439743984399440044014402440344044405440644074408440944104411441244134414441544164417441844194420442144224423442444254426442744284429443044314432443344344435443644374438443944404441444244434444444544464447444844494450445144524453445444554456445744584459446044614462446344644465446644674468446944704471447244734474447544764477447844794480448144824483448444854486448744884489449044914492449344944495449644974498449945004501450245034504450545064507450845094510451145124513451445154516451745184519452045214522452345244525452645274528452945304531453245334534453545364537453845394540454145424543454445454546454745484549455045514552455345544555455645574558455945604561456245634564456545664567456845694570457145724573457445754576457745784579458045814582458345844585458645874588458945904591459245934594459545964597459845994600460146024603460446054606460746084609461046114612461346144615461646174618461946204621462246234624462546264627462846294630463146324633463446354636463746384639464046414642464346444645464646474648464946504651465246534654465546564657465846594660466146624663466446654666466746684669467046714672467346744675467646774678467946804681468246834684468546864687468846894690469146924693469446954696469746984699470047014702470347044705470647074708470947104711471247134714471547164717471847194720472147224723472447254726472747284729473047314732473347344735473647374738473947404741474247434744474547464747474847494750475147524753475447554756475747584759476047614762476347644765476647674768476947704771477247734774477547764777477847794780478147824783478447854786478747884789479047914792479347944795479647974798479948004801480248034804480548064807480848094810481148124813481448154816481748184819482048214822482348244825482648274828482948304831483248334834483548364837483848394840484148424843484448454846484748484849485048514852485348544855485648574858485948604861486248634864486548664867486848694870487148724873487448754876487748784879488048814882488348844885488648874888488948904891489248934894489548964897489848994900490149024903490449054906490749084909491049114912491349144915491649174918491949204921492249234924492549264927492849294930493149324933493449354936493749384939494049414942494349444945494649474948494949504951495249534954495549564957495849594960496149624963496449654966496749684969497049714972497349744975497649774978497949804981498249834984498549864987498849894990499149924993499449954996499749984999500050015002500350045005500650075008500950105011501250135014501550165017501850195020502150225023502450255026502750285029503050315032503350345035503650375038503950405041504250435044504550465047504850495050505150525053505450555056505750585059506050615062506350645065506650675068506950705071507250735074507550765077507850795080508150825083508450855086508750885089509050915092509350945095509650975098509951005101510251035104510551065107510851095110511151125113511451155116511751185119512051215122512351245125512651275128512951305131513251335134513551365137513851395140514151425143514451455146514751485149515051515152515351545155515651575158515951605161516251635164516551665167516851695170517151725173517451755176517751785179518051815182518351845185518651875188518951905191519251935194519551965197519851995200520152025203520452055206520752085209521052115212521352145215521652175218521952205221522252235224522552265227522852295230523152325233523452355236523752385239524052415242524352445245524652475248524952505251525252535254525552565257525852595260526152625263526452655266526752685269527052715272527352745275527652775278527952805281528252835284528552865287528852895290529152925293529452955296529752985299530053015302530353045305530653075308530953105311531253135314531553165317531853195320532153225323532453255326532753285329533053315332533353345335533653375338533953405341534253435344534553465347534853495350535153525353535453555356535753585359536053615362536353645365536653675368536953705371537253735374537553765377537853795380538153825383538453855386538753885389539053915392539353945395539653975398539954005401540254035404540554065407540854095410541154125413541454155416541754185419542054215422542354245425542654275428542954305431543254335434543554365437543854395440544154425443544454455446544754485449545054515452545354545455545654575458545954605461546254635464546554665467546854695470547154725473547454755476547754785479548054815482548354845485548654875488548954905491549254935494549554965497549854995500550155025503550455055506550755085509551055115512551355145515551655175518551955205521552255235524552555265527552855295530553155325533553455355536553755385539554055415542554355445545554655475548554955505551555255535554555555565557555855595560556155625563556455655566556755685569557055715572557355745575557655775578557955805581558255835584558555865587558855895590559155925593559455955596559755985599560056015602560356045605560656075608560956105611561256135614561556165617561856195620562156225623562456255626562756285629563056315632563356345635563656375638563956405641564256435644564556465647564856495650565156525653565456555656565756585659566056615662566356645665566656675668566956705671567256735674567556765677567856795680568156825683568456855686568756885689569056915692569356945695569656975698569957005701570257035704570557065707570857095710571157125713571457155716571757185719572057215722572357245725572657275728572957305731573257335734573557365737573857395740574157425743574457455746574757485749575057515752575357545755575657575758575957605761576257635764576557665767576857695770577157725773577457755776577757785779578057815782578357845785578657875788578957905791579257935794579557965797579857995800580158025803580458055806580758085809581058115812581358145815581658175818581958205821582258235824582558265827582858295830583158325833583458355836583758385839584058415842584358445845584658475848584958505851585258535854585558565857585858595860586158625863586458655866586758685869587058715872587358745875587658775878587958805881588258835884588558865887588858895890589158925893589458955896589758985899590059015902590359045905590659075908590959105911591259135914591559165917591859195920592159225923592459255926592759285929593059315932593359345935593659375938593959405941594259435944594559465947594859495950595159525953595459555956595759585959596059615962596359645965596659675968596959705971597259735974597559765977597859795980598159825983598459855986598759885989599059915992599359945995599659975998599960006001600260036004600560066007600860096010601160126013601460156016601760186019602060216022602360246025602660276028602960306031603260336034603560366037603860396040604160426043604460456046604760486049605060516052605360546055605660576058605960606061606260636064606560666067606860696070607160726073607460756076607760786079608060816082608360846085608660876088608960906091609260936094609560966097609860996100610161026103610461056106610761086109611061116112611361146115611661176118611961206121612261236124612561266127612861296130613161326133613461356136613761386139614061416142614361446145614661476148614961506151615261536154615561566157615861596160616161626163616461656166616761686169617061716172617361746175617661776178617961806181618261836184618561866187618861896190619161926193619461956196619761986199620062016202620362046205620662076208620962106211621262136214621562166217621862196220622162226223622462256226622762286229623062316232623362346235623662376238623962406241624262436244624562466247624862496250625162526253625462556256625762586259626062616262626362646265626662676268626962706271627262736274627562766277627862796280628162826283628462856286628762886289629062916292629362946295629662976298629963006301630263036304630563066307630863096310631163126313631463156316631763186319632063216322632363246325632663276328632963306331633263336334633563366337633863396340634163426343634463456346634763486349635063516352635363546355635663576358635963606361636263636364636563666367636863696370637163726373637463756376637763786379638063816382638363846385638663876388638963906391639263936394639563966397639863996400640164026403640464056406640764086409641064116412641364146415641664176418641964206421642264236424642564266427642864296430643164326433643464356436643764386439644064416442644364446445644664476448644964506451645264536454645564566457645864596460646164626463646464656466646764686469647064716472647364746475647664776478647964806481648264836484648564866487648864896490649164926493649464956496649764986499650065016502650365046505650665076508650965106511651265136514651565166517651865196520652165226523652465256526652765286529653065316532653365346535653665376538653965406541654265436544654565466547654865496550655165526553655465556556655765586559656065616562656365646565656665676568656965706571657265736574657565766577657865796580658165826583658465856586658765886589659065916592659365946595659665976598659966006601660266036604660566066607660866096610661166126613661466156616661766186619662066216622662366246625662666276628662966306631663266336634663566366637663866396640664166426643664466456646664766486649665066516652665366546655665666576658665966606661666266636664666566666667666866696670667166726673667466756676667766786679668066816682668366846685668666876688
  1. <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
  2. "http://www.w3.org/TR/html4/loose.dtd">
  3. <html>
  4. <head>
  5. <meta http-equiv="Content-Type" content="text/html; charset=utf-8">
  6. <meta http-equiv="Content-Language" content="en-us">
  7. <title>UAX #44: Unicode Character Database</title>
  8. <link rel="stylesheet" type="text/css" href="http://www.unicode.org/reports/reports-v2.css">
  9. <style type="text/css">
  10. th { background-color: #CCFFCC }
  11. td.lightgray { background-color: #E4E4E4 }
  12. </style>
  13. </head>
  14. <body>
  15. <table class="header" cellspacing="0" cellpadding="0" width="100%">
  16. <tr>
  17. <td class="icon"><a href="http://www.unicode.org">
  18. <img align="middle" alt="[Unicode]" border="0" src="http://www.unicode.org/webscripts/logo60s2.gif" width="34" height="33"></a>&nbsp;&nbsp;
  19. <a class="bar" href="http://www.unicode.org/reports/">Technical Reports</a></td>
  20. </tr>
  21. <tr>
  22. <td class="gray">&nbsp;</td>
  23. </tr>
  24. </table>
  25. <div class="body">
  26. <!--
  27. <h2 class="uaxtitle"><span class="changedspan">Proposed Update</span></h2>
  28. -->
  29. <h2 class="uaxtitle">Unicode® Standard Annex #44</h2>
  30. <h1>Unicode Character Database</h1>
  31. <table class="simple" width="90%">
  32. <tr>
  33. <td valign="top" width="20%">Version</td>
  34. <td valign="top">Unicode 10.0.0</td>
  35. </tr>
  36. <tr>
  37. <td valign="top">Editors</td>
  38. <td valign="top"><a href="https://plus.google.com/114199149796022210033?rel=author">Mark Davis</a> (<a href="mailto:markdavis@google.com">markdavis@google.com</a>), Laurențiu Iancu (<a href="mailto:liancu@unicode.org">liancu@unicode.org</a>)
  39. and Ken Whistler (<a href="mailto:ken@unicode.org">ken@unicode.org</a>)</td>
  40. </tr>
  41. <tr>
  42. <td valign="top">Date</td>
  43. <td valign="top">2017-06-14</td>
  44. </tr>
  45. <tr>
  46. <td valign="top">This Version</td>
  47. <td valign="top">
  48. <a href="http://www.unicode.org/reports/tr44/tr44-20.html">http://www.unicode.org/reports/tr44/tr44-20.html</a>
  49. </td>
  50. </tr>
  51. <tr>
  52. <td valign="top">Previous Version</td>
  53. <td valign="top">
  54. <a href="http://www.unicode.org/reports/tr44/tr44-18.html">http://www.unicode.org/reports/tr44/tr44-18.html</a>
  55. </td>
  56. </tr>
  57. <tr>
  58. <td valign="top">Latest Version</td>
  59. <td valign="top"><a href="http://www.unicode.org/reports/tr44/">http://www.unicode.org/reports/tr44/</a></td>
  60. </tr>
  61. <tr>
  62. <td valign="top">Latest Proposed Update</td>
  63. <td valign="top"><a href="http://www.unicode.org/reports/tr44/proposed.html">http://www.unicode.org/reports/tr44/proposed.html</a></td>
  64. </tr>
  65. <tr>
  66. <td valign="top">Revision</td>
  67. <td valign="top"><a href="#Modifications">20</a></td>
  68. </tr>
  69. </table>
  70. <h4 class="summary">Summary</h4>
  71. <blockquote>
  72. <p><i>This annex provides the core documentation for the
  73. Unicode Character Database (UCD). It describes the layout and organization of the Unicode
  74. Character Database and how it specifies the formal definitions of the Unicode Character Properties.</i></p>
  75. </blockquote>
  76. <h4 class="status">Status</h4>
  77. <!-- NOT YET APPROVED
  78. <p><i><span class="changed">This is a<b><font color="#ff3333"> draft </font></b>document which
  79. may be updated, replaced, or superseded by other documents at any time.
  80. Publication does not imply endorsement by the Unicode Consortium. This is
  81. not a stable document; it is inappropriate to cite this document as other
  82. than a work in progress.</span></i></p>
  83. END NOT YET APPROVED -->
  84. <!-- APPROVED -->
  85. <p><i>This document has been reviewed by Unicode members and other interested
  86. parties, and has been approved for publication by the Unicode Consortium.
  87. This is a stable document and may be used as reference material or cited as
  88. a normative reference by other specifications.</i></p>
  89. <!-- END APPROVED -->
  90. <blockquote>
  91. <p><i><b>A Unicode Standard Annex (UAX)</b> forms an integral part of the
  92. Unicode Standard, but is published online as a separate document. The
  93. Unicode Standard may require conformance to normative content in a Unicode
  94. Standard Annex, if so specified in the Conformance chapter of that version
  95. of the Unicode Standard. The version number of a UAX document corresponds to
  96. the version of the Unicode Standard of which it forms a part.</i></p>
  97. </blockquote>
  98. <p><i>Please submit corrigenda and other comments with the online reporting
  99. form [<a href="http://www.unicode.org/reporting.html">Feedback</a>].
  100. Related information that is useful in understanding this annex is found in Unicode Standard Annex #41,
  101. “<a href="http://www.unicode.org/reports/tr41/tr41-21.html">Common References for Unicode Standard Annexes</a>.”
  102. For the latest version of the Unicode Standard, see [<a href="http://www.unicode.org/versions/latest/">Unicode</a>].
  103. For a list of current Unicode Technical Reports, see [<a href="http://www.unicode.org/reports/">Reports</a>].
  104. For more information about versions of the Unicode Standard, see [<a href="http://www.unicode.org/versions/">Versions</a>].
  105. For any errata which may apply to this annex, see [<a href="http://www.unicode.org/errata/">Errata</a>].</i></p>
  106. <h4 class="contents">Contents</h4>
  107. <ul class="toc">
  108. <li>1 <a href="#Introduction">Introduction</a></li>
  109. <li>2 <a href="#Conformance">Conformance</a>
  110. <ul class="toc">
  111. <li>2.1 <a href="#Simple_Derived">Simple and Derived Properties</a></li>
  112. <li>2.2 <a href="#Use_Default">Use of Default Values</a></li>
  113. <li>2.3 <a href="#Release_Stability">Stability of Releases</a></li>
  114. </ul></li>
  115. <li>3 <a href="#Documentation_Files">Documentation</a>
  116. <ul class="toc">
  117. <li>3.1 <a href="#Character_Properties">Character Properties in the Standard</a></li>
  118. <li>3.2 <a href="#Property_Model">The Character Property Model</a></li>
  119. <li>3.3 <a href="#NamesList">NamesList.html</a></li>
  120. <li>3.4 <a href="#StandardizedVariants">StandardizedVariants.html</a></li>
  121. <li>3.5 <a href="#EmojiVariants">Emoji Variation Sequences</a></li>
  122. <li>3.6 <a href="#Unihan">Unihan and UAX #38</a></li>
  123. <li>3.7 <a href="#USource">UTC-Source Ideographs and UAX #45</a></li>
  124. <li>3.8 <a href="#Data_File_Comments">Data File Comments</a></li>
  125. <li>3.9 <a href="#Obsolete">Obsolete Documentation Files</a></li>
  126. </ul></li>
  127. <li>4 <a href="#UCD_Files">UCD Files</a>
  128. <ul class="toc">
  129. <li>4.1 <a href="#Directory_Structure">Directory Structure</a></li>
  130. <li>4.2 <a href="#Format_Conventions">File Format Conventions</a></li>
  131. <li>4.3 <a href="#File_List">File List</a></li>
  132. <li>4.4 <a href="#Zipped_Files">Zipped Files</a></li>
  133. <li>4.5 <a href="#UCD_in_XML">UCD in XML</a></li>
  134. </ul></li>
  135. <li>5 <a href="#Properties">Properties</a>
  136. <ul class="toc">
  137. <li>5.1 <a href="#Property_Index">Property Index</a></li>
  138. <li>5.2 <a href="#About_Property_Table">About the Property Table</a></li>
  139. <li>5.3 <a href="#Property_Definitions">Property Definitions</a></li>
  140. <li>5.4 <a href="#Derived_Extracted">Derived Extracted Properties</a></li>
  141. <li>5.5 <a href="#Contributory_Properties">Contributory Properties</a></li>
  142. <li>5.6 <a href="#Casemapping">Case and Case Mapping</a></li>
  143. <li>5.7 <a href="#Property_Values">Property Value Lists</a></li>
  144. <li>5.8 <a href="#Property_And_Value_Aliases">Property and Property Value Aliases</a></li>
  145. <li>5.9 <a href="#Matching_Rules">Matching Rules</a></li>
  146. <li>5.10 <a href="#Invariants">Invariants</a></li>
  147. <li>5.11 <a href="#Validation">Validation</a></li>
  148. <li>5.12 <a href="#Deprecation">Deprecation</a></li>
  149. <li>5.13 <a href="#Property_APIs">Property APIs</a></li>
  150. <li>5.14 <a href="#Character_Age">Character Age</a></li>
  151. </ul></li>
  152. <li>6 <a href="#Test_Files">Test Files</a>
  153. <ul class="toc">
  154. <li>6.1 <a href="#NormalizationTest_txt">NormalizationTest.txt</a></li>
  155. <li>6.2 <a href="#Segmentation_Test_Files">Segmentation Test Files and Documentation</a></li>
  156. <li>6.3 <a href="#BidiTest_txt">Bidirectional Test Files</a></li>
  157. </ul></li>
  158. <li>7 <a href="#Change_History">UCD Change History</a></li>
  159. <li><a href="#Acknowledgments">Acknowledgments</a></li>
  160. <li><a href="#References">References</a></li>
  161. <li><a href="#Modifications">Modifications</a></li>
  162. </ul>
  163. <hr>
  164. <blockquote>
  165. <p><i><b>Note:</b> the information in
  166. this annex is not intended as an exhaustive description of the use and
  167. interpretation of Unicode character properties and behavior. It must be used in conjunction with
  168. the data in the other files in the Unicode Character Database, and relies on the notation and
  169. definitions supplied in <a href="http://www.unicode.org/standard/standard.html">The Unicode
  170. Standard</a>. All chapter references are to Version
  171. 10.0.0 of the standard unless otherwise indicated.</i></p>
  172. </blockquote>
  173. <h2>1 <a name="Introduction" href="#Introduction">Introduction</a></h2>
  174. <p>The Unicode Standard is far more than a simple encoding of characters.
  175. The standard also associates a rich set of semantics with each encoded
  176. character&#x2014;properties that
  177. are required for interoperability and correct behavior in
  178. implementations, as well as for Unicode conformance.
  179. These semantics are cataloged in the Unicode Character Database (UCD), a collection of data files
  180. which contain the Unicode character code points and character names.
  181. The data files define the Unicode character properties and mappings between
  182. Unicode characters (such as case mappings).</p>
  183. <p>This annex describes the UCD and provides a guide to the various
  184. documentation files associated with it. Additional information
  185. about character properties and their use is contained in the
  186. Unicode Standard and its annexes. In particular, implementers should familiarize themselves
  187. with the formal definitions and conformance requirements for properties detailed
  188. in <i>Section 3.5, Properties</i> in [<a href="../tr41/tr41-21.html#Unicode">Unicode</a>]
  189. and with the material in <i>Chapter 4, Character Properties</i> in [<a href="../tr41/tr41-21.html#Unicode">Unicode</a>].</p>
  190. <p>The latest version of the UCD is always located on the Unicode
  191. website at:</p>
  192. <blockquote>
  193. <a href="http://www.unicode.org/Public/UCD/latest/">http://www.unicode.org/Public/UCD/latest/</a>
  194. </blockquote>
  195. <p>The specific files for the UCD associated with this version of
  196. the Unicode Standard (10.0.0) are located at:</p>
  197. <blockquote>
  198. <a href="http://www.unicode.org/Public/10.0.0/">http://www.unicode.org/Public/10.0.0/</a>
  199. </blockquote>
  200. <p>Stable, archived versions of the UCD associated with all earlier
  201. versions of the Unicode Standard can be accessed from: </p>
  202. <blockquote>
  203. <a href="http://www.unicode.org/ucd/">http://www.unicode.org/ucd/</a>
  204. </blockquote>
  205. <p>For a description of the changes in the UCD for
  206. this version and earlier versions, see the
  207. <a href="#Change_History">UCD Change History</a>.</p>
  208. <h2>2 <a name="Conformance" href="#Conformance">Conformance</a></h2>
  209. <p>The Unicode Character Database is an integral part of the Unicode Standard.</p>
  210. <p>The UCD contains normative property and mapping information required for
  211. implementation of various Unicode algorithms such as the Unicode Bidirectional
  212. Algorithm, Unicode Normalization, and Unicode Casefolding. The data files also
  213. contain additional informative and provisional character property information.</p>
  214. <p>Each specification of a Unicode algorithm, whether specified in the text of
  215. [<a href="../tr41/tr41-21.html#Unicode">Unicode</a>] or in one of the Unicode
  216. Standard Annexes, designates which data file(s) in the UCD are needed to
  217. provide normative property information required by that algorithm.</p>
  218. <p>For information on the meaning and application of the terms,
  219. <i>normative</i>, <i>informative</i>, and <i>provisional</i>, see <i>Section 3.5,
  220. Properties</i> in [<a href="../tr41/tr41-21.html#Unicode">Unicode</a>].</p>
  221. <p>For information about the applicable terms of use for the
  222. UCD, see the Unicode <a href="http://www.unicode.org/copyright.html">Terms of Use</a>.</p>
  223. <h3>2.1 <a name="Simple_Derived" href="#Simple_Derived">Simple and Derived Properties</a></h3>
  224. <h4>2.1.1 <a name="Simple_Props" href="#Simple_Props">Simple Properties</a></h4>
  225. <p>Some character properties in the UCD are simple properties.
  226. This status has no bearing on whether or not the properties are
  227. normative, but merely indicates that their values
  228. are not derived from some combination of other properties.</p>
  229. <h4>2.1.2 <a name="Derived_Props" href="#Derived_Props">Derived Properties</a></h4>
  230. <p>Other character properties are derived. This means that
  231. their values are derived by rule from some other
  232. combination of properties. Generally such rules are
  233. stated as set operations, and may or may not include
  234. explicit exception lists for individual characters.</p>
  235. <p>Certain simple properties are defined merely
  236. to make the statement of the rule defining a derived
  237. property more compact or general. Such properties are
  238. known as <a href="#Contributory_Properties">contributory properties</a>.
  239. Sometimes these contributory properties are defined to
  240. encapsulate the messiness inherent in exception
  241. lists. At other times, a contributory property may
  242. be defined to help stabilize the definition of
  243. an important derived property which is subject to stability
  244. guarantees.</p>
  245. <p>Derived character properties are not considered
  246. second-class citizens among Unicode character properties.
  247. They are defined to make implementation of important
  248. algorithms easier to state. Included among the
  249. first-class derived properties important for such
  250. implementations are: Uppercase, Lowercase, XID_Start,
  251. XID_Continue, Math, and Default_Ignorable_Code_Point, all
  252. defined in DerivedCoreProperties.txt, as well as derived
  253. properties for the optimization of normalization, defined
  254. in DerivedNormalizationProps.txt.</p>
  255. <p>Implementations should simply use the derived properties,
  256. and should not try to rederive them from lists of simple
  257. properties and collections of rules, because of the
  258. chances for error and divergence when doing so.</p>
  259. <p>Definitions of property derivations are provided
  260. for information only, typically in comment fields
  261. in the data files. Such definitions may be refactored,
  262. refined, or corrected over time. These
  263. definitions are presented in a modified set notation, expressed
  264. as set additions and/or subtractions of various other property
  265. values. For example:</p>
  266. <blockquote>
  267. <pre>
  268. # Derived Property: ID_Start
  269. # Characters that can start an identifier.
  270. # Generated from:
  271. # Lu + Ll + Lt + Lm + Lo + Nl
  272. # + Other_ID_Start
  273. # - Pattern_Syntax
  274. # - Pattern_White_Space
  275. </pre>
  276. </blockquote>
  277. <p>When interpreting definitions of derived properties
  278. of this sort, keep in mind that set subtraction is not a commutative
  279. operation. Thus "Lo + Lm - Pattern_Syntax" defines a different set
  280. than "Lo - Pattern_Syntax + Lm". The order of property set operations
  281. stated in the definitions affects the composition of
  282. the derived set.</p>
  283. <p>If there are any cases of mismatches
  284. between the definition of a derived property as
  285. listed in DerivedCoreProperties.txt or similar data
  286. files in the UCD, and the definition of a derived
  287. property as a set definition rule, the explicit
  288. listing in the data file should <i>always</i> be taken
  289. as the normative definition of the property. As described
  290. in <a href="#Release_Stability">Stability of Releases</a> the property
  291. listing in the data files for any given version
  292. of the standard will never change for that version.</p>
  293. <h4>2.1.3 <a name="Props_External" href="#Props_External">Properties Dependent on External Specifications</a></h4>
  294. <p>In limited cases, a Unicode character property defined in the Unicode Character Database
  295. may have an external dependency on another specification which is not a part of the Unicode Standard,
  296. and whose data is not formally part of the UCD. In such cases, version stabiity for the UCD is attained by
  297. requiring that dependency to be based on a known, published version of the external specification.</p>
  298. <p>As of Version 10.0 of the UCD, the clear example of such an external dependency is the
  299. derivation of some segmentation-related character properties, in part based on emoji properties associated with
  300. UTS #51, "Unicode Emoji" [<a href="../tr41/tr41-21.html#UTS51">UTS51</a>]. The details of the
  301. derivation are described in the respective annexes, [<a href="../tr41/tr41-21.html#UAX14">UAX14</a>]
  302. and [<a href="../tr41/tr41-21.html#UAX29">UAX29</a>], as well as in the documentation portions of
  303. the associated UCD property files. See [<a href="../tr41/tr41-21.html#Data14">Data14</a>]
  304. and [<a href="../tr41/tr41-21.html#Props0">Props</a>].
  305. The version of UTS #51 used for those segmentation properties in Version 10.0 of the UCD is clearly
  306. identified in those annexes and data files.</p>
  307. <p>An external dependency may impact either a simple or a derived property. For example,
  308. the Line_Break property is considered a simple, enumerated property. However, two of the enumerated
  309. values, lb=Emoji_Base and lb=Emoji_Modifier, are synchronized with the associated emoji properties in
  310. emoji-data.txt. In the case of the derived segmentation properties associated with UAX #29,
  311. Grapheme_Cluster_Break, Word_Break, and Sentence_Break, the dependencies are considerably more complex.
  312. See [<a href="../tr41/tr41-21.html#UAX29">UAX29</a>] for full details.</p>
  313. <h3>2.2 <a name="Use_Default" href="#Use_Default">Use of Default Values</a></h3>
  314. <p>Unicode character properties have default values. Default
  315. values are the value or values that a character property takes
  316. for an unassigned code point, or in some instances, for
  317. designated subranges of code points, whether assigned or
  318. unassigned. For example, the default value of a binary
  319. Unicode character property is always "N".</p>
  320. <p>For the formal discussion of default values, see D26 in
  321. <i>Section 3.5, Properties</i> in [<a href="../tr41/tr41-21.html#Unicode">Unicode</a>].
  322. For conventions related to default values in various data files
  323. of the UCD and for documentation regarding the particular default values of
  324. individual Unicode character properties, see <a href="#Default_Values">Default Values</a>.</p>
  325. <h3>2.3 <a name="Release_Stability" href="#Release_Stability">Stability of Releases</a></h3>
  326. <p>Just as for the Unicode Standard as a whole, each version of the
  327. UCD, once published, is absolutely stable and will <i>never</i>
  328. change. Each released version is archived in a directory on
  329. the Unicode website, with a directory number associated with
  330. that version. URLs pointing to that version's directory are also
  331. stable and will be maintained in perpetuity.</p>
  332. <p>Any errors discovered for a released version of the UCD
  333. are noted in [<a href="../tr41/tr41-21.html#Errata">Errata</a>],
  334. and if appropriate will be corrected in a <i>subsequent</i>
  335. version of the UCD.</p>
  336. <p>Stability guarantees constraining how Unicode character
  337. properties can (or cannot) change between releases of the UCD
  338. are documented in the Unicode Consortium Stability
  339. Policies [<a href="../tr41/tr41-21.html#Stability">Stability</a>].</p>
  340. <h4>2.3.1 <a name="Allowed_Changes" href="#Allowed_Changes">Changes to Properties Between Releases</a></h4>
  341. <p>Updates to character properties in the Unicode Character Database may be required
  342. for any of three reasons:</p>
  343. <ol>
  344. <li>To cover new characters added to the standard</li>
  345. <li>To add new character properties to the standard</li>
  346. <li>To change the assigned values for a property for some characters already in the standard</li>
  347. </ol>
  348. <p>While the Unicode Consortium endeavors to keep the values of all
  349. character properties as stable as possible between versions, occasionally circumstances
  350. may arise which require changing them. In particular, as less well-documented scripts, such
  351. as those for minority languages, or historic scripts are added to the standard, the exact
  352. character properties and behavior may not fully be known when the script is first encoded.
  353. The properties for some of these characters may change as further information becomes
  354. available or as implementations turn up problems in the initial property assignments.
  355. As far as possible, any readjustment of property values based
  356. on growing implementation experience is made to be compatible with established practice.</p>
  357. <p>All changes to normative or informative property values, to the status
  358. or type of a property, or to property or property value aliases, must be approved by
  359. an explicit decision taken by the Unicode Technical Committee. Changes to provisional
  360. property values are subject to less stringent oversight.</p>
  361. <p>Occasionally, a character property value is changed to prevent incorrect generalizations
  362. about a character's use based on its nominal property values. For example, U+200B ZERO
  363. WIDTH SPACE was originally classified as a space character (General_Category=Zs), but
  364. it was reclassified as a Format character (General_Category=Cf) to clearly distinguish it from space characters
  365. in its function as a format control for line breaking.</p>
  366. <p>There is no guarantee that a particular value for an enumerated
  367. property will actually have characters associated with it. Also, because of
  368. changes in property value assignments between versions of the standard, a
  369. property value that once had characters associated with it may later have none.
  370. Such conditions and changes are rare, but implementations must not
  371. assume that all property values are associated with non-null
  372. sets of characters. For example, currently the special Script property
  373. value Katakana_Or_Hiragana has no characters associated with it.</p>
  374. <h4>2.3.2 <a name="Obsolete_Properties" href="#Obsolete_Properties">Obsolete Properties</a></h4>
  375. <p>In some instances an entire property may become <i>obsolete</i>.
  376. For example, the <a href="#ISO_Comment">ISO_Comment</a> property was once used to keep
  377. track of annotations for characters used in the production of name lists for
  378. ISO/IEC 10646 code charts. As of Unicode 5.2.0 that property became obsolete,
  379. and its value is now defaulted to the null string for all Unicode code points.</p>
  380. <p>An obsolete property is never removed from the UCD.</p>
  381. <h4>2.3.3 <a name="Deprecated_Properties" href="#Deprecated_Properties">Deprecated Properties</a></h4>
  382. <p>Occasionally an obsolete property may also be formally
  383. <i>deprecated</i>. This is an indication that the property is no longer recommended for
  384. use, perhaps because its original intent has been replaced by another property
  385. or because its specification was somehow defective. See also the
  386. general discussion of <a href="#Deprecation">Deprecation</a>.</p>
  387. <p>A deprecated property is never removed from the UCD.</p>
  388. <p><i>Table 1</i> lists the properties that are formally deprecated as of
  389. this version of the Unicode Standard.</p>
  390. <p class="caption">Table 1. <a name="Deprecated_Property_Table" href="#Deprecated_Property_Table">Deprecated Properties</a></p>
  391. <div align="center">
  392. <table class="simple">
  393. <tr>
  394. <th>Property Name</th>
  395. <th>Deprecation Version</th>
  396. <th>Reason</th>
  397. </tr>
  398. <tr>
  399. <td><a href="#Grapheme_Link">Grapheme_Link</a></td>
  400. <td>5.0.0</td>
  401. <td>Duplication of ccc=9</td>
  402. </tr>
  403. <tr>
  404. <td><a href="#Hyphen">Hyphen</a></td>
  405. <td>6.0.0</td>
  406. <td>Supplanted by Line_Break property values</td>
  407. </tr>
  408. <tr>
  409. <td><a href="#ISO_Comment">ISO_Comment</a></td>
  410. <td>6.0.0</td>
  411. <td>No longer needed for chart generation; otherwise not useful</td>
  412. </tr>
  413. <tr>
  414. <td><a href="#Expands_On_NFC">Expands_On_NFC</a></td>
  415. <td>6.0.0</td>
  416. <td>Less useful than UTF-specific calculations</td>
  417. </tr>
  418. <tr>
  419. <td><a href="#Expands_On_NFD">Expands_On_NFD</a></td>
  420. <td>6.0.0</td>
  421. <td>Less useful than UTF-specific calculations</td>
  422. </tr>
  423. <tr>
  424. <td><a href="#Expands_On_NFKC">Expands_On_NFKC</a></td>
  425. <td>6.0.0</td>
  426. <td>Less useful than UTF-specific calculations</td>
  427. </tr>
  428. <tr>
  429. <td><a href="#Expands_On_NFKD">Expands_On_NFKD</a></td>
  430. <td>6.0.0</td>
  431. <td>Less useful than UTF-specific calculations</td>
  432. </tr>
  433. <tr>
  434. <td><a href="#FC_NFKC_Closure">FC_NFKC_Closure</a></td>
  435. <td>6.0.0</td>
  436. <td>Supplanted in usage by <a href="#NFKC_Casefold">NFKC_Casefold</a>; otherwise not useful</td>
  437. </tr>
  438. </table>
  439. </div>
  440. <p>&nbsp;</p>
  441. <h4>2.3.4 <a name="Stabilized_Properties" href="#Stabilized_Properties">Stabilized Properties</a></h4>
  442. <p>Another possibility is that an obsolete property may be
  443. declared to be <i>stabilized</i>. Such a determination does not indicate that
  444. the property should or should not be used; instead it is a declaration that the
  445. UTC (Unicode Technical Committee) will no longer actively maintain the property or extend it for newly
  446. encoded characters. The property values of a
  447. stabilized property are frozen as of a particular release of the standard.</p>
  448. <p>A stabilized property is never removed from the UCD.</p>
  449. <p><i>Table 2</i> lists the properties that are formally stabilized as of
  450. this version of the Unicode Standard.</p>
  451. <p class="caption">Table 2. <a name="Stabilized_Property_Table" href="#Stabilized_Property_Table">Stabilized Properties</a></p>
  452. <div align="center">
  453. <table class="simple">
  454. <tr>
  455. <th>Property Name</th>
  456. <th>Stabilization Version</th>
  457. </tr>
  458. <tr>
  459. <td><a href="#Hyphen">Hyphen</a></td>
  460. <td>4.0.0</td>
  461. </tr>
  462. <tr>
  463. <td><a href="#ISO_Comment">ISO_Comment</a></td>
  464. <td>6.0.0</td>
  465. </tr>
  466. </table>
  467. </div>
  468. <p>&nbsp;</p>
  469. <h2>3 <a name="Documentation_Files" href="#Documentation_Files">Documentation</a></h2>
  470. <p>This annex provides the core documentation for the UCD, but
  471. additional information about character properties is available in
  472. other parts of the standard and in additional documentation files
  473. contained within the UCD.</p>
  474. <h3>3.1 <a name="Character_Properties" href="#Character_Properties">Character Properties in the Standard</a></h3>
  475. <p>The formal definitions related to character properties used
  476. by the Unicode Standard are documented in
  477. <i>Section 3.5, Properties</i> in [<a href="../tr41/tr41-21.html#Unicode">Unicode</a>].
  478. Understanding those definitions and related terminology is
  479. essential to the appropriate use of Unicode character properties.</p>
  480. <p>See <i>Section 4.1, Unicode Character Database</i>, in
  481. [<a href="../tr41/tr41-21.html#Unicode">Unicode</a>] for a general
  482. discussion of the UCD and its use in defining properties. The
  483. rest of Chapter 4 provides important explanations regarding
  484. the meaning and use of various normative character properties.</p>
  485. <h3>3.2 <a name="Property_Model" href="#Property_Model">The Character Property Model</a></h3>
  486. <p>For a general discussion of the property model which underlies
  487. the definitions associated with the UCD, see
  488. Unicode Technical Report #23, "The Unicode Character Property Model" [<a href="../tr41/tr41-21.html#UTR23">UTR23</a>].
  489. That technical report is informative, but over the years various
  490. content from it has been incorporated into normative portions
  491. of the Unicode Standard, particularly for the definitions in
  492. Chapter 3.</p>
  493. <p>UTR #23 also discusses string functions and their relation to
  494. character properties.</p>
  495. <h3>3.3 <a name="NamesList" href="#NamesList">NamesList.html</a></h3>
  496. <p>NamesList.html formally describes the format of the NamesList.txt data file in BNF.
  497. That data file is used to drive the printing
  498. of the Unicode code charts and names list. See also <i>Section 24.1,
  499. Character Names List</i>, in [<a href="../tr41/tr41-21.html#Unicode">Unicode</a>]
  500. for a detailed discussion of the conventions used in the Unicode names list as
  501. formatted for printing.</p>
  502. <h3>3.4 <a name="StandardizedVariants" href="#StandardizedVariants">StandardizedVariants.html</a></h3>
  503. <p>StandardizedVariants.html has been obsoleted
  504. as of Version 9.0 of the UCD. This file formerly
  505. documented standardized variants, showing a
  506. representative glyph for each. It was closely tied to the data file,
  507. StandardizedVariants.txt, which defines those sequences normatively.</p>
  508. <p>The function of StandardizedVariants.html to show representative
  509. glyphs for standardized variants has been superseded. There are now better means
  510. of illustrating the glyphs. Many standardized variation sequences are shown
  511. in the Unicode code charts directly, in summary sections at the ends of the
  512. names list for any block which contains them. Glyphs for standardized variants
  513. of CJK compatibility ideographs are also shown directly in the Unicode
  514. code charts. Because of the specialized font display requirements for
  515. emoji, often involving color, the standardized emoji variation sequences are not shown in the
  516. Unicode code charts, but have their own dedicated display page instead.</p>
  517. <h3>3.5 <a name="EmojiVariants" href="#EmojiVariants">Emoji Variation Sequences</a></h3>
  518. <p>Starting with Version 9.0.0, the following page in the Unicode emoji
  519. subsite area shows appropriate representative glyphs for all emoji variation sequences:</p>
  520. <p><a href="http://www.unicode.org/emoji/charts/emoji-variants.html">http://www.unicode.org/emoji/charts/emoji-variants.html</a></p>
  521. <p>Emoji variation sequences are a subset of standardized variation sequences,
  522. consisting of an emoji base followed either by the variation selector U+FE0E or the
  523. variation selector U+FE0F. Such sequences come in pairs, with the sequence using U+FE0E
  524. shown with a black and white text presentation, as seen in the Unicode code charts,
  525. and with the sequence using U+FE0F shown with a colorful icon, as usually seen
  526. in emoji implementations on mobile devices and elsewhere.</p>
  527. <h3>3.6 <a name="Unihan" href="#Unihan">Unihan and UAX #38</a></h3>
  528. <p>Unicode Standard Annex #38, "Unicode Han Database (Unihan)"
  529. [<a href="../tr41/tr41-21.html#UAX38">UAX38</a>] describes
  530. the format and content of the Unihan Database, which collects together all property information
  531. for CJK Unified Ideographs. That annex also specifies in detail
  532. which of the Unihan character properties are normative,
  533. informative, or provisional.</p>
  534. <p>The Unihan Database contains extensive and detailed mapping
  535. information for CJK Unified Ideographs encoded in the Unicode Standard,
  536. but it is aimed <i>only</i> at those ideographs, not at other characters used in the East
  537. Asian context in general.
  538. In contrast, East Asian legacy character sets, including important
  539. commercial and national character set standards, contain many non-CJK
  540. characters. As a result, the Unihan Database must be supplemented from
  541. other sources to establish mapping tables for those character sets.</p>
  542. <p>The majority of the content of the Unihan Database is
  543. released for each version of the Unicode Standard as a collection of Unihan data
  544. files in the UCD. Because of their large size, these data files are released only as
  545. a zipped file, Unihan.zip. The details of the particular data files in Unihan.zip
  546. and the CJK properties each one contains are provided in [<a href="../tr41/tr41-21.html#UAX38">UAX38</a>].
  547. For versions of the UCD prior to Version 5.2.0, all of the CJK properties were
  548. listed together in a very large, single file, Unihan.txt.</p>
  549. <h3>3.7 <a name="USource" href="#USource">UTC-Source Ideographs and UAX #45</a></h3>
  550. <p>Unicode Standard Annex #45, "U-Source Ideographs"
  551. [<a href="../tr41/tr41-21.html#UAX45">UAX45</a>] describes the format of USourceData.txt,
  552. which lists all of the information for UTC-Source ideographs.</p>
  553. <h3>3.8 <a name="Data_File_Comments" href="#Data_File_Comments">Data File Comments</a></h3>
  554. <p>In addition to the specific documentation files for the UCD, individual data
  555. files often contain extensive header comments describing their content and any
  556. special conventions used in the data.</p>
  557. <p>In some instances, individual property
  558. definition sections also contain comments with information about how the property
  559. may be derived. Such comments are informative; while they are intended
  560. to convey the intent of the derivation, in case of any mismatch between
  561. a statement of a derivation in a comment field and the actual
  562. listing of the derived property, the list is considered to be definitive.
  563. See <a href="#Simple_Derived">Simple and Derived Properties</a>.</p>
  564. <h3>3.9 <a name="Obsolete" href="#Obsolete">Obsolete Documentation Files</a></h3>
  565. <p>UCD.html was formerly the primary documentation file for the UCD. As of Version 5.2.0, its
  566. content has been wholly incorporated into this document.</p>
  567. <p>Unihan.html was formerly the primary documentation file for
  568. the Unihan Database. As of Version 5.1.0, its
  569. content has been wholly incorporated into [<a href="../tr41/tr41-21.html#UAX38">UAX38</a>].</p>
  570. <p>Versions of the Unicode Standard
  571. prior to Version 4.0.0 contained small, focused
  572. documentation files, UnicodeCharacterDatabase.html, PropList.html, and
  573. DerivedProperties.html, which were later consolidated into UCD.html.</p>
  574. <p>StandardizedVariants.html has been obsoleted as of Version 9.0.0.
  575. See <i>Section 3.4, <a href="#StandardizedVariants">StandardizedVariants.html</a></i>.</p>
  576. <h2>4 <a name="UCD_Files" href="#UCD_Files">UCD Files</a></h2>
  577. <p>The heart of the UCD consists of the data files themselves. This section
  578. describes the directory structure for the UCD, the format conventions
  579. for the data files, and provides documentation for data files not documented
  580. elsewhere in this annex.</p>
  581. <h3>4.1 <a name="Directory_Structure" href="#Directory_Structure">Directory Structure</a></h3>
  582. <p>Each version of the UCD is released in a separate, numbered directory
  583. under the <i>Public</i> directory on the Unicode website. The content of that
  584. directory is complete for that release. It is also stable&#x2014;once released,
  585. it will be archived permanently in that directory, unchanged, at a stable URL.</p>
  586. <p>The specific files for the UCD associated with this version of
  587. the Unicode Standard (10.0.0) are located at:</p>
  588. <blockquote>
  589. <a href="http://www.unicode.org/Public/10.0.0/">http://www.unicode.org/Public/10.0.0/</a>
  590. </blockquote>
  591. <p>The latest released version of the UCD is always accessible via the
  592. following stable URL:</p>
  593. <blockquote>
  594. <a href="http://www.unicode.org/Public/UCD/latest/">http://www.unicode.org/Public/UCD/latest/</a>
  595. </blockquote>
  596. <p>Zipped copies of the latest released version of the UCD are always accessible via the
  597. following stable URL:</p>
  598. <blockquote>
  599. <a href="http://www.unicode.org/Public/zipped/latest/">http://www.unicode.org/Public/zipped/latest/</a>
  600. </blockquote>
  601. <p>Prior to Version 6.3.0, access to the latest released version
  602. of the UCD was via the following stable URL:</p>
  603. <blockquote>
  604. <a href="http://www.unicode.org/Public/UNIDATA/">http://www.unicode.org/Public/UNIDATA/</a>
  605. </blockquote>
  606. <p>That "UNIDATA" URL will be maintained, but is no longer recommended, because
  607. it points to the <i>ucd</i> subdirectory of the latest release, rather than to the parent
  608. directory for the release. The "UNIDATA" naming convention is also very old, and does not follow
  609. the directory naming conventions currently used for other data releases in the
  610. <i>Public</i> directory on the Unicode website.</p>
  611. <h4>4.1.1 <a name="UCD_Proper" href="#UCD_Proper">UCD Files Proper</a></h4>
  612. <p>The UCD proper is located in the <i>ucd</i> subdirectory of the numbered version
  613. directory. That directory contains all of the documentation files and most
  614. of the data files for the UCD, including some data files for derived properties.</p>
  615. <p>Although all UCD data files are version-specific for a release and most contain
  616. internal date and version stamps, the file names of the released data files do not
  617. differ from version to version. When linking to a version-specific data file, the
  618. version will be indicated by the version number of the directory for the release.</p>
  619. <p>All files for derived extracted properties are in the <i>extracted</i>
  620. subdirectory of the <i>ucd</i> subdirectory.
  621. See <a href="#Derived_Extracted">Derived Extracted Properties</a> for
  622. documentation regarding those data files and their content.</p>
  623. <p>A number of auxiliary properties are specified in files in the <i>auxiliary</i>
  624. subdirectory of the <i>ucd</i> subdirectory. It contains
  625. data files specifying properties associated with
  626. Unicode Standard Annex #29, "Unicode Text Segmentation" [<a href="../tr41/tr41-21.html#UAX29">UAX29</a>]
  627. and with
  628. Unicode Standard Annex #14, "Unicode Line Breaking Algorithm" [<a href="../tr41/tr41-21.html#UAX14">UAX14</a>],
  629. as well as test data for those algorithms.
  630. See <a href="#Segmentation_Test_Files">Segmentation Test Files and Documentation</a>
  631. for more information about the test data.</p>
  632. <h4>4.1.2 <a name="UCD_XML_Files" href="#UCD_XML_Files">UCD XML Files</a></h4>
  633. <p>The XML version of the UCD is located in the <i>ucdxml</i> subdirectory of the
  634. numbered version directory. See the <a href="#UCD_in_XML">UCD in XML</a> for
  635. more details.</p>
  636. <h4>4.1.3 <a name="Chart_Files" href="#Chart_Files">Charts</a></h4>
  637. <p>The code charts specific to a version of Unicode are archived
  638. as a single large pdf file in the <i>charts</i> subdirectory of the
  639. numbered version directory. See the readme.txt in that subdirectory
  640. and the general web page explaining the
  641. <a href="http://www.unicode.org/charts/About.html">Unicode Code Charts</a> for
  642. more details.</p>
  643. <h4>4.1.4 <a name="Beta_Review" href="#Beta_Review">Beta Review Considerations</a></h4>
  644. <p>Prior to the formal release for any particular version of the UCD, a beta
  645. review is conducted. The beta review files are located in the same directory
  646. that is later used for the released UCD, but during the beta review period,
  647. the subdirectory structure differs somewhat and may contain temporary files,
  648. including documentation of diffs between deltas for the beta review. Also,
  649. during the beta review, all data file names are suffixed with version
  650. numbers and delta numbers. So a typical file name during beta review
  651. may be "PropList-5.2.0d13.txt" instead of the finally released "PropList.txt".</p>
  652. <p>Notices contained in a ReadMe.txt file in the UCD directory during the
  653. beta review period also make it clear that that directory contains
  654. preliminary material under review, rather than a final, stable release.</p>
  655. <h4>4.1.5 <a name="Directory_History" href="#Directory_History">File Directory Differences for Early Releases</a></h4>
  656. <p>The <a href="#UCD_in_XML">UCD in XML</a> was introduced in Version 5.1.0,
  657. so UCD directories prior to that do not contain the <i>ucdxml</i> subdirectory.</p>
  658. <p>UCD directories prior to Version 4.1.0 do not contain the <i>auxiliary</i>
  659. subdirectory.</p>
  660. <p>UCD directories prior to Version 3.2.0 do not contain the <i>extracted</i>
  661. subdirectory.</p>
  662. <p>The general structure of the file directory for a released version of the UCD
  663. described above applies to Versions 4.1.0 and later. Prior to Version 4.1.0,
  664. versions of the UCD were not self-contained, complete sets of data files
  665. for that version, but instead only contained any new data files or any data files
  666. which had <i>changed</i> since the prior release.</p>
  667. <p>Because of this, the property files for a given version
  668. prior to Version 4.1.0 can be spread over several directories. Consult the
  669. component listings at
  670. <a href="http://www.unicode.org/versions/enumeratedversions.html">Enumerated Versions</a>
  671. to find out which files in which directories comprise a complete set of data
  672. files for that version.</p>
  673. <p>The directory naming conventions and the file naming conventions also
  674. differed prior to Version 4.1.0. So, for example, Version 4.0.0 of the UCD
  675. is contained in a directory named <i>4.0-Update</i>, and Version 4.0.1 of
  676. the UCD in a directory named <i>4.0-Update1</i>. Furthermore, for these
  677. earlier versions, the data file names <i>do</i> contain explicit version
  678. numbers.</p>
  679. <h3>4.2 <a name="Format_Conventions" href="#Format_Conventions">File Format Conventions</a></h3>
  680. <p>Files in the UCD use the format conventions described in
  681. this section, unless otherwise specified.</p>
  682. <h4>4.2.1 <a name="Data_Fields" href="#Data_Fields">Data Fields</a></h4>
  683. <ul>
  684. <li>Each line of data consists of fields separated by semicolons. The fields are numbered
  685. starting with zero.</li>
  686. <li>The first field (0) of each line in the Unicode Character Database files represents a code
  687. point or range. The remaining fields (1..n) are properties associated with that code point.</li>
  688. <li>Leading and trailing spaces within a field are not significant.
  689. However, no leading or trailing spaces
  690. are allowed in any field of UnicodeData.txt. For legacy reasons,
  691. no spaces are allowed before or after the semicolon in LineBreak.txt and in EastAsianWidth.txt.</li>
  692. <li>The Unihan data files in the UCD have a separate format, using tab characters
  693. instead of semicolons to separate fields. See [<a href="../tr41/tr41-21.html#UAX38">UAX38</a>]
  694. for the detailed specification of the format of the Unihan data files. The
  695. data files TangutSources.txt and NushuSources.txt also use this format.</li>
  696. </ul>
  697. <h4>4.2.2 <a name="Code_Points" href="#Code_Points">Code Points and Sequences</a></h4>
  698. <ul>
  699. <li>Code points are expressed as hexadecimal numbers with four to six digits.
  700. They are written without the &quot;U+&quot; prefix in
  701. all data files except the Unihan data files. The Unihan data files use the &quot;U+&quot; prefix for
  702. all Unicode code points, to distinguish them from other decimal and hexadecimal
  703. numerical references occurring in their data fields.</li>
  704. <li>When a data field contains a sequence of code points, spaces separate
  705. the code points.
  706. </li>
  707. </ul>
  708. <h4>4.2.3 <a name="Code_Point_Ranges" href="#Code_Point_Ranges">Code Point Ranges</a></h4>
  709. <ul>
  710. <li>A range of code points is specified by the form &quot;X..Y&quot;.</li>
  711. <li>Each code point in a range has the
  712. associated property value specified on a data file. For example (from Blocks.txt):
  713. <blockquote>
  714. <pre>
  715. 0000..007F; Basic Latin
  716. 0080..00FF; Latin-1 Supplement
  717. </pre>
  718. </blockquote>
  719. </li>
  720. <li>For backward compatibility, ranges in the file UnicodeData.txt
  721. are specified by entries for the
  722. start and end characters of the range, rather than by the form &quot;X..Y&quot;.
  723. The start character is indicated by a range identifier, followed by a comma
  724. and the string &quot;First&quot;, in angle brackets. This entry takes the
  725. place of a regular character name in field 1 for that line.
  726. The end character is indicated on the next line with the same range identifier,
  727. followed by a comma and the string &quot;Last&quot;, in angle brackets:
  728. <blockquote>
  729. <pre>
  730. 4E00;&lt;CJK Ideograph, First&gt;;Lo;0;L;;;;;N;;;;;
  731. 9FD5;&lt;CJK Ideograph, Last&gt;;Lo;0;L;;;;;N;;;;;
  732. </pre>
  733. </blockquote>
  734. For character ranges using this convention, the names of all characters in the range
  735. are algorithmically derivable.
  736. See <i>Section 4.8, Name</i>
  737. in [<a href="../tr41/tr41-21.html#Unicode">Unicode</a>] for more information on
  738. derivation of character names for such ranges.</li>
  739. </ul>
  740. <h4>4.2.4 <a name="Comments" href="#Comments">Comments</a></h4>
  741. <ul>
  742. <li>U+0023 NUMBER SIGN (&quot;#&quot;) is used to indicate comments: all
  743. characters from the number sign to the end
  744. of the line are considered part of the comment, and are disregarded when parsing data.</li>
  745. <li>In many files, the comments on data
  746. lines use a common format, as illustrated here (from Scripts.txt):
  747. <blockquote>
  748. <pre>09B2 ; Bengali # Lo BENGALI LETTER LA</pre>
  749. </blockquote>
  750. </li>
  751. <li>The first part of a comment using this common format is the General_Category value,
  752. provided for information. This is followed by the character name for
  753. the code point in the first field (0).</li>
  754. <li>The printing of the General_Category value is suppressed in instances where
  755. it would be redundant, as for DerivedGeneralCategory.txt, in which the value
  756. of the property value in the data field is already the General_Category value.</li>
  757. <li>The symbol &quot;L&amp;&quot;
  758. indicates characters of General_Category Lu, Ll, or Lt (uppercase, lowercase,
  759. or titlecase letter). For example:
  760. <blockquote>
  761. <pre>0386 ; Greek # L&amp; GREEK CAPITAL LETTER ALPHA WITH TONOS</pre>
  762. </blockquote>
  763. L&amp; as used in these comments is an alias for
  764. the derived LC value (cased letter) for the General_Category property, as documented in
  765. PropertyValueAliases.txt.</li>
  766. <li>When the data line contains a range of code points, this common format
  767. for a comment also indicates a range of character names, separated by &quot;..&quot;, as
  768. illustrated here (from DerivedNumericType.txt):
  769. <blockquote>
  770. <pre>00BC..00BE ; Numeric # No [3] VULGAR FRACTION ONE QUARTER..VULGAR FRACTION THREE QUARTERS</pre>
  771. </blockquote>
  772. </li>
  773. <li>Normally, consecutive characters with the same property value would be
  774. represented by a single code point range. In data files using this
  775. comment convention, such ranges are subdivided so that all
  776. characters in a range also
  777. have the same General_Category value (or LC).
  778. While this convention results in more ranges than are strictly necessary, it
  779. makes the contents of the ranges clearer.</li>
  780. <li>When a code point range occurs, the number of items in the range is
  781. included in the comment (in square brackets), immediately following the General_Category value.</li>
  782. <li>The comments are purely informational, and may change format or be omitted in the
  783. future. They should not be parsed for content.</li>
  784. </ul>
  785. <h4>4.2.5 <a name="Code_Point_Labels" href="#Code_Point_Labels">Code Point Labels</a></h4>
  786. <ul>
  787. <li>Surrogate code points, private-use characters, control codes, noncharacters,
  788. and unassigned code points have no names. When such code points are
  789. listed in the data files, for example to list their General_Category
  790. values, the comments use code point labels instead of character
  791. names. For example (from DerivedCoreProperties.txt):
  792. <blockquote>
  793. <pre>2065 ; Default_Ignorable_Code_Point # Cn &lt;reserved-2065&gt;</pre>
  794. </blockquote>
  795. </li>
  796. <li>Code point labels use one of the tags as documented in
  797. <i>Section 4.8, Name</i>
  798. in [<a href="../tr41/tr41-21.html#Unicode">Unicode</a>] and as shown in <i>Table 3</i>,
  799. followed by &quot;-&quot; and the code point expressed in hexadecimal. The
  800. entire label is then enclosed in angle brackets.</li>
  801. </ul>
  802. <p class="caption">Table 3. <a name="Label_Tags_Table" href="#Label_Tags_Table">Code Point Label Tags</a></p>
  803. <div align="center">
  804. <table class="simple">
  805. <tr>
  806. <th>Tag</th>
  807. <th>General_Category</th>
  808. <th>Note</th>
  809. </tr>
  810. <tr>
  811. <td>reserved</td>
  812. <td>Cn</td>
  813. <td>Noncharacter_Code_Point=F</td>
  814. </tr>
  815. <tr>
  816. <td>noncharacter</td>
  817. <td>Cn</td>
  818. <td>Noncharacter_Code_Point=T</td>
  819. </tr>
  820. <tr>
  821. <td>control</td>
  822. <td>Cc</td>
  823. <td>&nbsp;</td>
  824. </tr>
  825. <tr>
  826. <td>private-use</td>
  827. <td>Co</td>
  828. <td>&nbsp;</td>
  829. </tr>
  830. <tr>
  831. <td>surrogate</td>
  832. <td>Cs</td>
  833. <td>&nbsp;</td>
  834. </tr>
  835. </table>
  836. </div>
  837. <p>&nbsp;</p>
  838. <h4>4.2.6 <a name="Multiple_Properties" href="#Multiple_Properties">Multiple Properties in One Data File</a></h4>
  839. <ul>
  840. <li>When a file contains the specification for multiple properties, the second field specifies the name
  841. of the property and the third field specifies the property value. For example (from
  842. DerivedNormalizationProps.txt):
  843. <blockquote>
  844. <pre>
  845. 03D2 ; FC_NFKC; 03C5 # L&amp; GREEK UPSILON WITH HOOK SYMBOL
  846. 03D3 ; FC_NFKC; 03CD # L&amp; GREEK UPSILON WITH ACUTE AND HOOK SYMBOL
  847. </pre>
  848. </blockquote>
  849. </li>
  850. </ul>
  851. <h4>4.2.7 <a name="Binary_Values" href="#Binary_Values">Binary Property Values</a></h4>
  852. <ul>
  853. <li>For binary properties, the second field specifies the name of the applicable property, with
  854. the implied value of the property being &quot;True&quot;. Only the ranges of characters with the binary
  855. property value of &quot;Y&quot; (= True) are listed. For example (from PropList.txt):
  856. <blockquote>
  857. <pre>
  858. 1680 ; White_Space # Zs OGHAM SPACE MARK
  859. 2000..200A ; White_Space # Zs [11] EN QUAD..HAIR SPACE
  860. </pre>
  861. </blockquote>
  862. </li>
  863. </ul>
  864. <h4>4.2.8 <a name="Multiple_Values" href="#Multiple_Values">Multiple Values for Properties</a></h4>
  865. <ul>
  866. <li>When a data file defines a property which may take multiple values for a single code
  867. point, the multiple values are expressed in a space-delimited list. For example (from ScriptExtensions.txt):
  868. <blockquote>
  869. <pre>
  870. 0640 ; Adlm Arab Mand Mani Phlp Syrc # Lm ARABIC TATWEEL
  871. </pre>
  872. </blockquote>
  873. </li>
  874. <li>In some cases&#x2014;but not all&#x2014;the order of multiple elements in a space-delimited
  875. list may be significant. When the order of multiple elements is significant, it is documented
  876. along with the property itself. For example (from Unihan_Readings.txt), for the tag kMandarin,
  877. when there are two values for a code point, the first value is used to
  878. indicate a preferred pronunciation for zh-Hans (CN) and the second a
  879. preferred pronunciation for zh-Hant (TW).
  880. </li>
  881. <li>For further discussion, see Section 5.7.6 <a href="#Property_Values_As_Sets">Properties Whose Values Are Sets of Values</a>.</li>
  882. </ul>
  883. <h4>4.2.9 <a name="Default_Values" href="#Default_Values">Default Values</a></h4>
  884. <ul>
  885. <li>Entries for a code point may be omitted in a data file if the
  886. code point has a default value for the property in question.</li>
  887. <li>For string properties,
  888. including the definition of foldings, the
  889. default value is the code point of the character itself.</li>
  890. <li>For miscellaneous properties which take strings as values,
  891. such as the Unicode Name property, the default value is a null
  892. string.</li>
  893. <li>For binary properties, the default value is always &quot;N&quot; (= False)
  894. and is always omitted.</li>
  895. <li>For enumerated and catalog properties, the default value is listed in a comment. For
  896. example (from Scripts.txt):
  897. <blockquote>
  898. <pre>
  899. # All code points not explicitly listed for Script
  900. # have the value Unknown (Zzzz).
  901. </pre>
  902. </blockquote>
  903. </li>
  904. <li>A few properties of the enumerated type have multiple default values. In
  905. those cases, comments in the file explain the code point ranges for applicable values.
  906. See also <a href="#Default_Values_Table"><i>Table 4</i></a>.</li>
  907. <li>Default values are also listed in specially-formatted comment lines,
  908. using the keyword &quot;@missing&quot;. Parsers which extract and process
  909. these lines can algorithmically determine the default values for all code points.
  910. See <a href="#Missing_Conventions">@missing Conventions</a>
  911. for details about the syntax and use of these lines.
  912. </li>
  913. <li>Because of the legacy format constraints for UnicodeData.txt, that
  914. file contains no specific information about default values for properties.
  915. The default values for fields in UnicodeData.txt are documented
  916. in <a href="#Default_Values_Table"><i>Table 4</i></a> below
  917. if they cannot be derived from the general rules about default values
  918. for properties.</li>
  919. <li>The file ArabicShaping.txt is also exceptional, because it omits the listing
  920. of many characters whose property value (jt=T) can be derived by rule. Adding an &quot;@missing&quot; line
  921. to that file would result in the wrong interpretation of Joining_Type values for omitted characters.
  922. The full explicit listing of Joining_Type values and the correct &quot;@missing&quot; line for
  923. the default Joining_Type value (jt=U) can be found in the file DerivedJoiningType.txt instead.</li>
  924. </ul>
  925. <p>Default values for common catalog, enumeration, and
  926. numeric properties are listed in <i>Table 4</i>.
  927. Further explanation is provided below the table, in
  928. those cases where the default values
  929. are complex, as indicated in the third column.</p>
  930. <p class="caption">Table 4. <a name="Default_Values_Table" href="#Default_Values_Table">Default Values for Properties</a></p>
  931. <div align="center">
  932. <table class="simple">
  933. <tr>
  934. <th>Property Name</th>
  935. <th>Default Value(s)</th>
  936. <th>Complex?</th>
  937. </tr>
  938. <tr>
  939. <td>Age</td>
  940. <td>Unassigned (= NA)</td>
  941. <td>No</td>
  942. </tr>
  943. <tr>
  944. <td>Bidi_Class</td>
  945. <td>L, AL, R, BN, ET</td>
  946. <td>Yes</td>
  947. </tr>
  948. <tr>
  949. <td>Block</td>
  950. <td>No_Block</td>
  951. <td>No</td>
  952. </tr>
  953. <tr>
  954. <td>Canonical_Combining_Class</td>
  955. <td>Not_Reordered (= 0)</td>
  956. <td>No</td>
  957. </tr>
  958. <tr>
  959. <td>Decomposition_Type</td>
  960. <td>None</td>
  961. <td>No</td>
  962. </tr>
  963. <tr>
  964. <td>East_Asian_Width</td>
  965. <td>Neutral (= N), Wide (= W)</td>
  966. <td>Yes</td>
  967. </tr>
  968. <tr>
  969. <td>General_Category</td>
  970. <td>Cn</td>
  971. <td>No</td>
  972. </tr>
  973. <tr>
  974. <td>Line_Break</td>
  975. <td>Unknown (= XX), ID, PR</td>
  976. <td>Yes</td>
  977. </tr>
  978. <tr>
  979. <td>Numeric_Type</td>
  980. <td>None</td>
  981. <td>No</td>
  982. </tr>
  983. <tr>
  984. <td>Numeric_Value</td>
  985. <td>NaN</td>
  986. <td>No</td>
  987. </tr>
  988. <tr>
  989. <td>Script</td>
  990. <td>Unknown (= Zzzz)</td>
  991. <td>No</td>
  992. </tr>
  993. <tr>
  994. <td>Vertical_Orientation</td>
  995. <td>Rotated (= R), Upright (= U)</td>
  996. <td>Yes</td>
  997. </tr>
  998. </table>
  999. </div>
  1000. <p><i>Complex default values</i> are those which take multiple values, contingent on
  1001. code point ranges or other conditions. Complex default values other than those specified in the
  1002. &quot;@missing&quot; line are explicitly listed in the relevant property file, except for instances
  1003. noted in this section. This means that a parser extracting property values from
  1004. the UCD should never encounter an ambiguous condition for which the default value of a property
  1005. for a particular code point is unclear.</p>
  1006. <p>Default values for the
  1007. <a href="#Bidi_Class">Bidi_Class</a> property are complex. See
  1008. Unicode Standard Annex #9, "Unicode Bidirectional Algorithm" [<a href="../tr41/tr41-21.html#UAX9">UAX9</a>]
  1009. and DerivedBidiClass.txt for full details.</p>
  1010. <p>Default values for the <a href="#East_Asian_Width">East_Asian_Width</a>
  1011. property are complex. This property defaults to Neutral for most code points, but defaults to Wide
  1012. for unassigned code points in blocks associated with CJK ideographs.
  1013. See Unicode Standard Annex #11, "East Asian Width"
  1014. [<a href="../tr41/tr41-21.html#UAX11">UAX11</a>] and
  1015. EastAsianWidth.txt for documentation of the default values
  1016. and DerivedEastAsianWidth.txt for the full listing of values.</p>
  1017. <p>Default values for the <a href="#Line_Break">Line_Break</a>
  1018. property are complex. This property defaults to Unknown for most code points, but defaults to ID
  1019. for unassigned code points in blocks associated with CJK ideographs, and
  1020. in blocks in the range U+1F000..U+1FFFD.
  1021. The property defaults to PR for unassigned code
  1022. points in the Currency Symbols block. See Unicode Standard Annex #14, "Unicode Line Breaking Algorithm"
  1023. [<a href="../tr41/tr41-21.html#UAX14">UAX14</a>]
  1024. and LineBreak.txt for documentation of the default values
  1025. and DerivedLineBreak.txt for the full listing of values.</p>
  1026. <p>Default values for the <a href="#Vertical_Orientation">Vertical_Orientation</a>
  1027. property are complex. This property defaults to Rotated (R) for most code points,
  1028. but defaults to Upright (U)
  1029. for unassigned code points in blocks associated with scripts that are themselves predominantly Upright.
  1030. See Unicode Standard Annex #50, "Unicode Vertical Text Layout"
  1031. [<a href="../tr41/tr41-21.html#UAX50">UAX50</a>] and VerticalOrientation.txt for full details.</p>
  1032. <h4>4.2.10 <a name="Missing_Conventions" href="#Missing_Conventions">@missing Conventions</a></h4>
  1033. <p>Specially-formatted comment lines with the keyword "@missing" are
  1034. used to define default property values for ranges of code points not explicitly listed
  1035. in a data file. These lines follow regular conventions that make them
  1036. machine-readable.</p>
  1037. <p>An @missing line starts with the comment character "#", followed by
  1038. a space, then the "@missing" keyword, followed by a colon, another space, a code
  1039. point range, and a semicolon. Then the
  1040. line typically continues with a semicolon-delimited list of one or more
  1041. default property values. For example:</p>
  1042. <blockquote>
  1043. <pre>
  1044. # @missing: 0000..10FFFF; Unknown
  1045. </pre>
  1046. </blockquote>
  1047. <p>In general, the code point range and semicolon-delimited list follow
  1048. the same syntactic conventions as the data file in which the @missing line occurs, so
  1049. that any parser which interprets that data file can easily be adapted to also
  1050. parse and interpret an @missing line to pick up default property values for code points.</p>
  1051. <p>@missing lines are also supplied for many properties in the file
  1052. PropertyValueAliases.txt. In this case, because there are many @missing lines in that
  1053. single data file, each @missing line contains an additional second field specifying the
  1054. property name for which it defines a default value.</p>
  1055. <p>An @missing line is never provided for a binary property, because the
  1056. default value for binary properties is always "N" and need not be defined redundantly
  1057. for each binary property.</p>
  1058. <p>Because of the
  1059. addition of property names when @missing lines are included in PropertyValueAliases.txt,
  1060. there are currently two syntactic patterns used for @missing lines, as
  1061. summarized schematically below:</p>
  1062. <ol>
  1063. <li>code_point_range; default_prop_val</li>
  1064. <li>code_point_range; property_name; default_prop_val</li>
  1065. </ol>
  1066. <p>In this schematic representation, "default_prop_val" stands in for
  1067. either an explicit property value or for a special tag such as &lt;none&gt; or
  1068. &lt;script&gt;.</p>
  1069. <p>Pattern #1 is used in most primary and derived UCD files. For example:</p>
  1070. <blockquote>
  1071. <pre>
  1072. # @missing: 0000..10FFFF; &lt;none&gt;
  1073. </pre>
  1074. </blockquote>
  1075. <p>Pattern #2 is used in PropertyValueAliases.txt and in
  1076. DerivedNormalizationProps.txt, both of which contain values associated with many
  1077. properties. For example:</p>
  1078. <blockquote>
  1079. <pre>
  1080. # @missing: 0000..10FFFF; NFD_QC; Yes
  1081. </pre>
  1082. </blockquote>
  1083. <p>The special tag values which may occur in the default_prop_val field
  1084. in an @missing line are interpreted as follows:</p>
  1085. <div align="center">
  1086. <table class="simple">
  1087. <tr>
  1088. <th>Tag</th>
  1089. <th>Interpretation</th>
  1090. </tr>
  1091. <tr>
  1092. <td>&lt;none&gt;</td>
  1093. <td>the empty string</td>
  1094. </tr>
  1095. <tr>
  1096. <td>&lt;code point&gt;</td>
  1097. <td>the string representation of the code point value</td>
  1098. </tr>
  1099. <tr>
  1100. <td>&lt;script&gt;</td>
  1101. <td>the value equal to the Script property value for this code point</td>
  1102. </tr>
  1103. </table>
  1104. </div>
  1105. <p>&nbsp;</p>
  1106. <h4>4.2.11 <a name="Empty_Fields" href="#Empty_Fields">Empty Fields</a></h4>
  1107. <p>The data file UnicodeData.txt defines many property values in each record. When a
  1108. field in a data line for a code point is empty, that indicates that the property takes
  1109. the default value for that code point. For example:</p>
  1110. <blockquote>
  1111. <pre>
  1112. 0022;QUOTATION MARK;Po;0;ON;;;;;N;;;;;
  1113. </pre>
  1114. </blockquote>
  1115. <p>In that data line, the empty numeric fields indicate that the value of Numeric_Value for
  1116. U+0022 is NaN and that the value of Numeric_Type is None. The empty case mapping fields indicate
  1117. that the value of Simple_Uppercase_Mapping for U+0022 takes the default value, namely the
  1118. code point itself, and so forth.</p>
  1119. <p>The interpretation of empty fields in other data files of the UCD differs. In the
  1120. case of data files which define string properties, the omission of an entry for a code point
  1121. indicates that the property takes the default value for that code point. However, if there
  1122. is an entry for a code point, but the property value field for that entry is empty, that
  1123. indicates that the property value is an explicit empty string (""). For example, the derived string
  1124. property <a href="#NFKC_Casefold">NFKC_Casefold</a> may map a code point to a sequence of code points, to a single different code
  1125. point, to the same single code point, or to no code point at all (an empty string). See the following entries from
  1126. the data file DerivedNormalizationProps.txt:</p>
  1127. <blockquote>
  1128. <pre>
  1129. 00AA ; NFKC_CF; 0061 # Lo FEMININE ORDINAL INDICATOR
  1130. 00AD ; NFKC_CF; # Cf SOFT HYPHEN
  1131. 00AF ; NFKC_CF; 0020 0304 # Sk MACRON
  1132. </pre>
  1133. </blockquote>
  1134. <p>The empty field for U+00AD indicates that the property NFKC_Casefold maps SOFT HYPHEN
  1135. to an empty string. By contrast, the absence of the entry for U+00AE in the data file indicates
  1136. that the property NFKC_Casefold maps U+00AE REGISTERED SIGN to itself&#x2014;the default value.</p>
  1137. <h4>4.2.12 <a name="Text_Encoding" href="#Text_Encoding">Text Encoding</a></h4>
  1138. <ul>
  1139. <li>The data files use UTF-8. Unless otherwise noted, non-ASCII characters only
  1140. appear in comments.</li>
  1141. <li>The Unihan data files in the UCD make extensive use of UTF-8 in data fields.
  1142. (See [<a href="../tr41/tr41-21.html#UAX38">UAX38</a>] for details.)</li>
  1143. <li>For legacy reasons, NamesList.txt was exceptional; it was encoded
  1144. in Latin-1 prior to Unicode 6.2. For
  1145. Unicode 6.2 and later, the encoding is UTF-8. See <a href="#NamesList">NamesList.html</a>.</li>
  1146. <li>Segmentation test data files, such as WordBreakTest.txt, make
  1147. use of non-ASCII (UTF-8) characters as delimiters for data fields.</li>
  1148. </ul>
  1149. <h4>4.2.13 <a name="Line_Termination" href="#Line_Termination">Line Termination</a></h4>
  1150. <ul>
  1151. <li>All data files in the UCD use LF line termination (not CRLF line termination).
  1152. When copied to different systems, these line endings may be automatically changed to
  1153. use the native line termination conventions for that system. Make sure your editor (or parser) can
  1154. deal with the line termination
  1155. style in the local copy of the data files.</li>
  1156. </ul>
  1157. <h4>4.2.14 <a name="Other_Conventions" href="#Other_Conventions">Other Conventions</a></h4>
  1158. <ul>
  1159. <li>In some test data files, segments of the test data are distinguished by a line
  1160. starting with an &quot;@&quot; sign. For example (from NormalizationTest.txt):
  1161. <blockquote>
  1162. <pre>
  1163. @Part1 # Character by character test
  1164. </pre>
  1165. </blockquote>
  1166. </li>
  1167. </ul>
  1168. <h4>4.2.15 <a name="Other_File_Formats" href="#Other_File_Formats">Other File Formats</a></h4>
  1169. <ul>
  1170. <li>The data format for Unihan data files and for
  1171. TangutSources.txt and NushuSources.txt
  1172. in the UCD differs from the standard format.
  1173. See the discussion of <a href="#Unihan">Unihan and UAX #38</a>
  1174. earlier in this annex for more information.</li>
  1175. <li>The format for NamesList.txt, which documents the Unicode names
  1176. list and which is used programmatically to drive the formatting
  1177. program for Unicode code charts, also differs significantly from regular UCD data files.
  1178. See <a href="#NamesList">NamesList.html</a></li>
  1179. <li>Index.txt is another exception. It uses a tab-delimited format, with field 0
  1180. consisting of an index entry string, and field 1 a code point. Index.txt is used to
  1181. maintain the <a href="http://www.unicode.org/charts/charindex.html">
  1182. Unicode Character Name Index</a>.</li>
  1183. <li>The various segmentation test data files make use of &quot;#&quot; to delimit comments,
  1184. but have distinct conventions for their data fields. See the documentation
  1185. in their header sections for details of the data field formats for
  1186. those files.</li>
  1187. <li>The XML version of the UCD has its own file format conventions.
  1188. In those files, "#" is used to stand for the code point in
  1189. algorithmically derivable character names such as CJK UNIFIED IDEOGRAPH-4E00
  1190. or TANGUT IDEOGRAPH-17000,
  1191. so as to allow for name sharing in more compact representations of the data.
  1192. See Unicode Standard Annex #42, "Unicode Character Database in XML"
  1193. [<a href="../tr41/tr41-21.html#UAX42">UAX42</a>] for details.</li>
  1194. </ul>
  1195. <h3>4.3 <a name="File_List" href="#File_List">File List</a></h3>
  1196. <p>The exact list of files associated with any particular version of the UCD is
  1197. available on the Unicode website by referring to the component listings at
  1198. <a href="http://www.unicode.org/versions/enumeratedversions.html">Enumerated Versions</a>.</p>
  1199. <p>The majority of the data files in the UCD provide specifications of
  1200. character properties for Unicode characters. Those files and their contents
  1201. are documented in detail in the <a href="#Property_Definitions">Property Definitions</a> section
  1202. below.</p>
  1203. <p>The data files in the <i>extracted</i> subdirectory constitute reformatted listings
  1204. of single character properties extracted from UnicodeData.txt or other primary
  1205. data files. The reformatting is provided to make it easier to see the particular set
  1206. of characters having certain values for enumerated properties, or to separate
  1207. the statement of that property from other properties defined together
  1208. in UnicodeData.txt. These files also include explicit
  1209. listings of default values for the respective properties. These extracted, derived data files are further documented in
  1210. the <a href="#Derived_Extracted">Derived Extracted Properties</a> section below.</p>
  1211. <p>The UCD also contains a number of test data files, whose purpose is to provide
  1212. standard test cases useful in verifying the implementation of complex Unicode
  1213. algorithms. See the <a href="#Test_Files">Test Files</a> section below for more
  1214. documentation.</p>
  1215. <p>The remaining files in the Unicode Character Database do not directly specify Unicode
  1216. properties. The important ones and their functions are listed in <i>Table 5</i>.
  1217. The Status column indicates whether the file (and its content) is considered
  1218. <b>N</b>ormative, <b>I</b>nformative, or <b>P</b>rovisional.</p>
  1219. <p class="caption">Table 5. <a name="UCD_Files_Table" href="#UCD_Files_Table">Files in the UCD</a></p>
  1220. <table class="simple">
  1221. <tr>
  1222. <th>File Name</th>
  1223. <th>Reference</th>
  1224. <th>Status</th>
  1225. <th>Description</th>
  1226. </tr>
  1227. <tr>
  1228. <td>CJKRadicals.txt</td>
  1229. <td>[<a href="../tr41/tr41-21.html#UAX38">UAX38</a>]</td>
  1230. <td style="text-align:center">I</td>
  1231. <td>List of Unified CJK Ideographs and CJK Radicals that correspond to
  1232. specific radical numbers used in the CJK radical stroke counts.</td>
  1233. </tr>
  1234. <tr>
  1235. <td>USourceData.txt</td>
  1236. <td>[<a href="../tr41/tr41-21.html#UAX45">UAX45</a>]</td>
  1237. <td style="text-align:center">N</td>
  1238. <td>The list of formal references for UTC-Source ideographs, together with data regarding
  1239. their status and sources.</td>
  1240. </tr>
  1241. <tr>
  1242. <td>USourceGlyphs.pdf</td>
  1243. <td>[<a href="../tr41/tr41-21.html#UAX45">UAX45</a>]</td>
  1244. <td style="text-align:center">I</td>
  1245. <td>A table containing a representative glyph for each UTC-Source ideograph.</td>
  1246. </tr>
  1247. <tr>
  1248. <td>TangutSources.txt</td>
  1249. <td>Chapter&nbsp;18</td>
  1250. <td style="text-align:center">N</td>
  1251. <td>Specifies normative source mappings for
  1252. Tangut ideographs and components. This data
  1253. file also includes informative radical-stroke values that are used in
  1254. the preparation of the code charts for the Tangut blocks.<br>
  1255. <b>kTGT_MergedSrc</b>: normative source mapping to various Tangut source references<br>
  1256. <b>kRSTUnicode</b>: informative radical-stroke value</td>
  1257. </tr>
  1258. <tr>
  1259. <td>NushuSources.txt</td>
  1260. <td>Chapter&nbsp;18</td>
  1261. <td style="text-align:center">N</td>
  1262. <td>Specifies normative source mappings for Nushu ideographs. This data
  1263. file also includes informative readings for Nushu characters.<br>
  1264. <b>kSrc_NushuDuben</b>: normative source mapping to the Nushu Duben<br>
  1265. <b>kReading</b>: informative example phonetic reading</td>
  1266. </tr>
  1267. <tr>
  1268. <td>EmojiSources.txt</td>
  1269. <td>Chapter&nbsp;22</td>
  1270. <td style="text-align:center">N</td>
  1271. <td>Specifies source mappings to SJIS values for emoji symbols in the original implementations
  1272. of these symbols by Japanese telecommunications companies.</td>
  1273. </tr>
  1274. <tr>
  1275. <td>Index.txt</td>
  1276. <td>Chapter&nbsp;24</td>
  1277. <td style="text-align:center">I</td>
  1278. <td>Index to Unicode characters.</td>
  1279. </tr>
  1280. <tr>
  1281. <td>NamesList.txt</td>
  1282. <td>Chapter&nbsp;24</td>
  1283. <td style="text-align:center">I</td>
  1284. <td>Names list used for production of the code charts, derived from UnicodeData.txt.
  1285. It contains additional annotations.</td>
  1286. </tr>
  1287. <tr>
  1288. <td><a href="#NamesList">NamesList.html</a></td>
  1289. <td>Chapter&nbsp;24</td>
  1290. <td style="text-align:center">I</td>
  1291. <td>Documents the format of NamesList.txt. </td>
  1292. </tr>
  1293. <tr>
  1294. <td>StandardizedVariants.txt</td>
  1295. <td>Chapter&nbsp;23</td>
  1296. <td style="text-align:center">N</td>
  1297. <td>Lists all the standardized variant sequences that have been defined, plus a textual description of
  1298. their desired appearance.</td>
  1299. </tr>
  1300. <tr>
  1301. <td><a href="#StandardizedVariants">StandardizedVariants.html</a></td>
  1302. <td>Chapter&nbsp;23</td>
  1303. <td style="text-align:center">N</td>
  1304. <td>An obsolete derived documentation file.</td>
  1305. </tr>
  1306. <tr>
  1307. <td>NamedSequences.txt</td>
  1308. <td>[<a href="../tr41/tr41-21.html#UAX34">UAX34</a>]</td>
  1309. <td style="text-align:center">N</td>
  1310. <td>Lists the names for all approved named sequences.</td>
  1311. </tr>
  1312. <tr>
  1313. <td>NamedSequencesProv.txt</td>
  1314. <td>[<a href="../tr41/tr41-21.html#UAX34">UAX34</a>]</td>
  1315. <td style="text-align:center">P</td>
  1316. <td>Lists the names for all provisional named sequences.</td>
  1317. </tr>
  1318. </table>
  1319. <p>For more information about these files and their use, see the referenced annexes or
  1320. chapters of Unicode Standard.</p>
  1321. <h3>4.4 <a name="Zipped_Files" href="#Zipped_Files">Zipped Files</a></h3>
  1322. <p>Starting with Version 4.1.0, zipped versions of all of the UCD files,
  1323. both data files and documentation files, are available under the <i>Public/zipped</i>
  1324. directory on the Unicode website. Each collection of zipped files is located
  1325. there in a numbered subdirectory corresponding to that version of the UCD.</p>
  1326. <p>Two different zipped files are provided for each version:</p>
  1327. <ul>
  1328. <li><b>Unihan.zip</b> is the zipped version of the very large Unihan data
  1329. files</li>
  1330. <li><b>UCD.zip</b> is the zipped
  1331. version of all of the rest of the UCD data files, excluding
  1332. the Unihan data files.</li>
  1333. </ul>
  1334. <p>This bifurcation allows for better management of downloading version-specific
  1335. information, because Unihan.zip contains all the pertinent CJK-related
  1336. property information, while UCD.zip contains all of the rest of the UCD
  1337. property information, for those who may not need the voluminous CJK data.</p>
  1338. <p>Starting with Version 6.1.0 the main versioned directories for the UCD also contain a copy
  1339. of UCD.zip, for convenience in access.</p>
  1340. <p>In versions of the UCD prior to Version 4.1.0, zipped copies of the
  1341. Unihan data files (which for those versions were released as a single large text file, Unihan.txt)
  1342. are provided in the same directory as the UCD data files. These zipped files are only posted
  1343. for versions of the UCD in which Unihan.txt was updated.</p>
  1344. <h3>4.5 <a name="UCD_in_XML" href="#UCD_in_XML">UCD in XML</a></h3>
  1345. <p>Starting with Version 5.1.0, a set of XML data
  1346. files are also released with each version of the UCD. Those
  1347. data files make it possible to import and process the UCD property data using
  1348. standard XML parsing tools, instead of the specialized parsing required for the
  1349. various individual data files of the UCD.</p>
  1350. <h4>4.5.1 <a name="UAX42_doc" href="#UAX42_doc">UAX #42</a></h4>
  1351. <p>Unicode Standard Annex #42, "Unicode Character Database in XML" [<a href="../tr41/tr41-21.html#UAX42">UAX42</a>]
  1352. defines an XML schema
  1353. which is used to incorporate all of the Unicode character property information
  1354. into the XML version of the UCD. See that annex for details of the
  1355. schema and conventions regarding the grouping of property values for
  1356. more compact representations.</p>
  1357. <h4>4.5.2 <a name="XML_files" href="#XML_files">XML File List</a></h4>
  1358. <p>The XML version of the UCD is contained in the <i>ucdxml</i> subdirectory
  1359. of the UCD. The files are all zipped. The list of files is shown in
  1360. <i>Table 6</i>.</p>
  1361. <p class="caption">Table 6. <a name="XML_Files_Table" href="#XML_Files_Table">XML File List</a></p>
  1362. <div align="center">
  1363. <table class="simple">
  1364. <tr>
  1365. <th>File Name</th>
  1366. <th>CJK</th>
  1367. <th>non-CJK</th>
  1368. </tr>
  1369. <tr>
  1370. <td>ucd.all.flat.zip</td>
  1371. <td style="text-align:center">x</td>
  1372. <td style="text-align:center">x</td>
  1373. </tr>
  1374. <tr>
  1375. <td>ucd.all.grouped.zip</td>
  1376. <td style="text-align:center">x</td>
  1377. <td style="text-align:center">x</td>
  1378. </tr>
  1379. <tr>
  1380. <td>ucd.nounihan.flat.zip</td>
  1381. <td>&nbsp;</td>
  1382. <td style="text-align:center">x</td>
  1383. </tr>
  1384. <tr>
  1385. <td>ucd.nounihan.grouped.zip</td>
  1386. <td>&nbsp;</td>
  1387. <td style="text-align:center">x</td>
  1388. </tr>
  1389. <tr>
  1390. <td>ucd.unihan.flat.zip</td>
  1391. <td style="text-align:center">x</td>
  1392. <td>&nbsp;</td>
  1393. </tr>
  1394. <tr>
  1395. <td>ucd.unihan.grouped.zip</td>
  1396. <td style="text-align:center">x</td>
  1397. <td>&nbsp;</td>
  1398. </tr>
  1399. </table>
  1400. </div>
  1401. <p>The "flat" file versions simply list all attributes with no
  1402. particular compression. The "grouped" file versions apply the
  1403. grouping mechanism described in [<a href="../tr41/tr41-21.html#UAX42">UAX42</a>]
  1404. to cut down on the size of the data files.</p>
  1405. <h2>5 <a name="Properties" href="#Properties">Properties</a></h2>
  1406. <p>This section documents the Unicode character properties, relating them
  1407. in detail to the particular UCD data files in which they are specified.
  1408. For enumerated properties in particular, this section also documents the
  1409. actual values which those properties can have.</p>
  1410. <h3>5.1 <a name="Property_Index" href="#Property_Index">Property Index</a></h3>
  1411. <p><i>Table 7</i> provides a summary list of the Unicode character properties,
  1412. excluding most of those specific to the Unihan
  1413. data files. For a comparable
  1414. index of CJK character properties, see Unicode Standard Annex #38, "Unicode Han Database (Unihan)"
  1415. [<a href="../tr41/tr41-21.html#UAX38">UAX38</a>].</p>
  1416. <p>The properties are roughly organized into groups
  1417. based on their usage. This grouping is primarily for documentation convenience and
  1418. except for <a href="#Contributory_Properties">contributory properties</a>, has no
  1419. normative implications. Contributory properties are
  1420. shown in this index with a <span class="lightgray">gray background</span>, to better distinguish them visually from
  1421. ordinary (simple or derived) properties.
  1422. Deprecated properties and other properties
  1423. not recommended for support in public <a href="#Property_APIs">property APIs</a> are also shown
  1424. with a <span class="lightgray">gray background</span>.
  1425. The link on each property leads to its
  1426. description in
  1427. <i>Table 9, <a href="#Property_List_Table">Property Table</a></i>.
  1428. Any property marked as
  1429. <a href="#Deprecated_Properties">deprecated</a> in this index is
  1430. also automatically considered <a href="#Obsolete_Properties">obsolete</a>.</p>
  1431. <p class="caption">Table 7. <a name="Property_Index_Table" href="#Property_Index_Table">Property Index by Scope of Use</a></p>
  1432. <div align="center">
  1433. <table class="simple">
  1434. <tr>
  1435. <th width="33%">General</th>
  1436. <th width="33%">Normalization</th>
  1437. <th width="33%">CJK</th>
  1438. </tr>
  1439. <tr>
  1440. <td><a href="#Name">Name</a></td>
  1441. <td><a href="#Canonical_Combining_Class">Canonical_Combining_Class</a></td>
  1442. <td><a href="#Ideographic">Ideographic</a></td>
  1443. </tr>
  1444. <tr>
  1445. <td><a href="#Name_Alias">Name_Alias</a></td>
  1446. <td class="lightgray"><a href="#Decomposition_Mapping">Decomposition_Mapping</a></td>
  1447. <td><a href="#Unified_Ideograph">Unified_Ideograph</a></td>
  1448. </tr>
  1449. <tr>
  1450. <td><a href="#Block">Block</a></td>
  1451. <td class="lightgray"><a href="#Composition_Exclusion">Composition_Exclusion</a></td>
  1452. <td><a href="#Radical">Radical</a></td>
  1453. </tr>
  1454. <tr>
  1455. <td><a href="#Age">Age</a></td>
  1456. <td class="lightgray"><a href="#Full_Composition_Exclusion">Full_Composition_Exclusion</a></td>
  1457. <td><a href="#IDS_Binary_Operator">IDS_Binary_Operator</a></td>
  1458. </tr>
  1459. <tr>
  1460. <td><a href="#General_Category">General_Category</a></td>
  1461. <td><a href="#Decomposition_Type">Decomposition_Type</a></td>
  1462. <td><a href="#IDS_Trinary_Operator">IDS_Trinary_Operator</a></td>
  1463. </tr>
  1464. <tr>
  1465. <td><a href="#Script">Script</a></td>
  1466. <td class="lightgray"><a href="#FC_NFKC_Closure">FC_NFKC_Closure</a> (deprecated)</td>
  1467. <td><a href="#Unicode_Radical_Stroke">Unicode_Radical_Stroke</a></td>
  1468. </tr>
  1469. <tr>
  1470. <td><a href="#Script_Extensions">Script_Extensions</a></td>
  1471. <td>&nbsp;</td>
  1472. <td>&nbsp;</td>
  1473. </tr>
  1474. <tr>
  1475. <td><a href="#White_Space">White_Space</a></td>
  1476. <td><a href="#NFC_Quick_Check">NFC_Quick_Check</a></td>
  1477. <th>Miscellaneous</th>
  1478. </tr>
  1479. <tr>
  1480. <td><a href="#Alphabetic">Alphabetic</a></td>
  1481. <td><a href="#NFKC_Quick_Check">NFKC_Quick_Check</a></td>
  1482. <td><a href="#Math">Math</a></td>
  1483. </tr>
  1484. <tr>
  1485. <td><a href="#Hangul_Syllable_Type">Hangul_Syllable_Type</a></td>
  1486. <td><a href="#NFD_Quick_Check">NFD_Quick_Check</a></td>
  1487. <td><a href="#Quotation_Mark">Quotation_Mark</a></td>
  1488. </tr>
  1489. <tr>
  1490. <td><a href="#Noncharacter_Code_Point">Noncharacter_Code_Point</a></td>
  1491. <td><a href="#NFKD_Quick_Check">NFKD_Quick_Check</a></td>
  1492. <td><a href="#Dash">Dash</a></td>
  1493. </tr>
  1494. <tr>
  1495. <td><a href="#Default_Ignorable_Code_Point">Default_Ignorable_Code_Point</a></td>
  1496. <td class="lightgray"><a href="#Expands_On_NFC">Expands_On_NFC</a> (deprecated)</td>
  1497. <td class="lightgray"><a href="#Hyphen">Hyphen</a> (deprecated, stabilized)</td>
  1498. </tr>
  1499. <tr>
  1500. <td><a href="#Deprecated">Deprecated</a></td>
  1501. <td class="lightgray"><a href="#Expands_On_NFD">Expands_On_NFD</a> (deprecated)</td>
  1502. <td><a href="#STerm">Sentence_Terminal</a></td>
  1503. </tr>
  1504. <tr>
  1505. <td><a href="#Logical_Order_Exception">Logical_Order_Exception</a></td>
  1506. <td class="lightgray"><a href="#Expands_On_NFKC">Expands_On_NFKC</a> (deprecated)</td>
  1507. <td><a href="#Terminal_Punctuation">Terminal_Punctuation</a></td>
  1508. </tr>
  1509. <tr>
  1510. <td><a href="#Variation_Selector">Variation_Selector</a></td>
  1511. <td class="lightgray"><a href="#Expands_On_NFKD">Expands_On_NFKD</a> (deprecated)</td>
  1512. <td><a href="#Diacritic">Diacritic</a></td>
  1513. </tr>
  1514. <tr>
  1515. <th>Case</th>
  1516. <td><a href="#NFKC_Casefold">NFKC_Casefold</a></td>
  1517. <td><a href="#Extender">Extender</a></td>
  1518. </tr>
  1519. <tr>
  1520. <td><a href="#Uppercase">Uppercase</a></td>
  1521. <td><a href="#CWKCF">Changes_When_NFKC_Casefolded</a></td>
  1522. <td><a href="#Grapheme_Base">Grapheme_Base</a></td>
  1523. </tr>
  1524. <tr>
  1525. <td><a href="#Lowercase">Lowercase</a></td>
  1526. <th>Shaping and Rendering</th>
  1527. <td><a href="#Grapheme_Extend">Grapheme_Extend</a></td>
  1528. </tr>
  1529. <tr>
  1530. <td><a href="#Lowercase_Mapping">Lowercase_Mapping</a></td>
  1531. <td><a href="#Join_Control">Join_Control</a></td>
  1532. <td class="lightgray"><a href="#Grapheme_Link">Grapheme_Link</a> (deprecated)</td>
  1533. </tr>
  1534. <tr>
  1535. <td><a href="#Titlecase_Mapping">Titlecase_Mapping</a></td>
  1536. <td><a href="#Joining_Group">Joining_Group</a></td>
  1537. <td><a href="#Unicode_1_Name">Unicode_1_Name</a></td>
  1538. </tr>
  1539. <tr>
  1540. <td><a href="#Uppercase_Mapping">Uppercase_Mapping</a></td>
  1541. <td><a href="#Joining_Type">Joining_Type</a></td>
  1542. <td class="lightgray"><a href="#ISO_Comment">ISO_Comment</a> (deprecated, stabilized)</td>
  1543. </tr>
  1544. <tr>
  1545. <td>&nbsp;</td>
  1546. <td><a href="#Vertical_Orientation">Vertical_Orientation</a></td>
  1547. <td><a href="#Regional_Indicator">Regional_Indicator</a></td>
  1548. </tr>
  1549. <tr>
  1550. <td><a href="#Case_Folding">Case_Folding</a></td>
  1551. <td><a href="#Line_Break">Line_Break</a></td>
  1552. <td><a href="#Indic_Positional_Category">Indic_Positional_Category</a></td>
  1553. </tr>
  1554. <tr>
  1555. <td><a href="#Simple_Lowercase_Mapping">Simple_Lowercase_Mapping</a></td>
  1556. <td><a href="#Grapheme_Cluster_Break">Grapheme_Cluster_Break</a></td>
  1557. <td><a href="#Indic_Syllabic_Category">Indic_Syllabic_Category</a></td>
  1558. </tr>
  1559. <tr>
  1560. <td><a href="#Simple_Titlecase_Mapping">Simple_Titlecase_Mapping</a></td>
  1561. <td><a href="#Sentence_Break">Sentence_Break</a></td>
  1562. <th>Contributory Properties</th>
  1563. </tr>
  1564. <tr>
  1565. <td><a href="#Simple_Uppercase_Mapping">Simple_Uppercase_Mapping</a></td>
  1566. <td><a href="#Word_Break">Word_Break</a></td>
  1567. <td class="lightgray"><a href="#Other_Alphabetic">Other_Alphabetic</a></td>
  1568. </tr>
  1569. <tr>
  1570. <td><a href="#Simple_Case_Folding">Simple_Case_Folding</a></td>
  1571. <td><a href="#East_Asian_Width">East_Asian_Width</a></td>
  1572. <td class="lightgray"><a href="#Other_Default_Ignorable_Code_Point">Other_Default_Ignorable_Code_Point</a></td>
  1573. </tr>
  1574. <tr>
  1575. <td><a href="#Soft_Dotted">Soft_Dotted</a></td>
  1576. <td><a href="#Prepended_Concatenation_Mark">Prepended_Concatenation_Mark</a></td>
  1577. <td class="lightgray"><a href="#Other_Grapheme_Extend">Other_Grapheme_Extend</a></td>
  1578. </tr>
  1579. <tr>
  1580. <td><a href="#Cased">Cased</a></td>
  1581. <th>Bidirectional</th>
  1582. <td class="lightgray"><a href="#Other_ID_Start">Other_ID_Start</a></td>
  1583. </tr>
  1584. <tr>
  1585. <td><a href="#Case_Ignorable">Case_Ignorable</a></td>
  1586. <td><a href="#Bidi_Class">Bidi_Class</a></td>
  1587. <td class="lightgray"><a href="#Other_ID_Continue">Other_ID_Continue</a></td>
  1588. </tr>
  1589. <tr>
  1590. <td><a href="#CWL">Changes_When_Lowercased</a></td>
  1591. <td><a href="#Bidi_Control">Bidi_Control</a></td>
  1592. <td class="lightgray"><a href="#Other_Lowercase">Other_Lowercase</a></td>
  1593. </tr>
  1594. <tr>
  1595. <td><a href="#CWU">Changes_When_Uppercased</a></td>
  1596. <td><a href="#Bidi_Mirrored">Bidi_Mirrored</a></td>
  1597. <td class="lightgray"><a href="#Other_Math">Other_Math</a></td>
  1598. </tr>
  1599. <tr>
  1600. <td><a href="#CWT">Changes_When_Titlecased</a></td>
  1601. <td><a href="#Bidi_Mirroring_Glyph">Bidi_Mirroring_Glyph</a></td>
  1602. <td class="lightgray"><a href="#Other_Uppercase">Other_Uppercase</a></td>
  1603. </tr>
  1604. <tr>
  1605. <td><a href="#CWCF">Changes_When_Casefolded</a></td>
  1606. <td><a href="#Bidi_Paired_Bracket">Bidi_Paired_Bracket</a></td>
  1607. <td class="lightgray"><a href="#Jamo_Short_Name">Jamo_Short_Name</a></td>
  1608. </tr>
  1609. <tr>
  1610. <td><a href="#CWCM">Changes_When_Casemapped</a></td>
  1611. <td><a href="#Bidi_Paired_Bracket_Type">Bidi_Paired_Bracket_Type</a></td>
  1612. <td>&nbsp;</td>
  1613. </tr>
  1614. <tr>
  1615. <th>Numeric</th>
  1616. <th>Identifiers</th>
  1617. <td>&nbsp;</td>
  1618. </tr>
  1619. <tr>
  1620. <td><a href="#Numeric_Value">Numeric_Value</a></td>
  1621. <td><a href="#ID_Continue">ID_Continue</a></td>
  1622. <td>&nbsp;</td>
  1623. </tr>
  1624. <tr>
  1625. <td><a href="#Numeric_Type">Numeric_Type</a></td>
  1626. <td><a href="#ID_Start">ID_Start</a></td>
  1627. <td>&nbsp;</td>
  1628. </tr>
  1629. <tr>
  1630. <td><a href="#Hex_Digit">Hex_Digit</a></td>
  1631. <td><a href="#XID_Continue">XID_Continue</a></td>
  1632. <td>&nbsp;</td>
  1633. </tr>
  1634. <tr>
  1635. <td><a href="#ASCII_Hex_Digit">ASCII_Hex_Digit</a></td>
  1636. <td><a href="#XID_Start">XID_Start</a></td>
  1637. <td>&nbsp;</td>
  1638. </tr>
  1639. <tr>
  1640. <td>&nbsp;</td>
  1641. <td><a href="#Pattern_Syntax">Pattern_Syntax</a></td>
  1642. <td>&nbsp;</td>
  1643. </tr>
  1644. <tr>
  1645. <td>&nbsp;</td>
  1646. <td><a href="#Pattern_White_Space">Pattern_White_Space</a></td>
  1647. <td>&nbsp;</td>
  1648. </tr>
  1649. </table>
  1650. </div>
  1651. <p>&nbsp;</p>
  1652. <h3>5.2 <a name="About_Property_Table" href="#About_Property_Table">About the Property Table</a></h3>
  1653. <p><i>Table 9, <a href="#Property_List_Table">Property Table</a></i>
  1654. specifies the list of character properties
  1655. defined in the UCD.
  1656. That table is divided into separate sections for each data
  1657. file in the UCD. Data files which define a single property or a small number of properties are listed
  1658. first, followed by the data files which define a
  1659. large number of properties: <a href="#DerivedCoreProperties.txt">DerivedCoreProperties.txt</a>,
  1660. <a href="#DerivedNormalizationProps.txt">DerivedNormalizationProps.txt</a>,
  1661. <a href="#PropList.txt">PropList.txt</a>, and <a href="#UnicodeData.txt">UnicodeData.txt</a>.
  1662. In some instances for these files defining many properties, the
  1663. entries in the property table are grouped by type, for clarity in presentation, rather than
  1664. being listed alphabetically.</p>
  1665. <p>In <i>Table 9,
  1666. <a href="#Property_List_Table">Property Table</a></i> each property is described as follows:</p>
  1667. <p><b>First Column.</b> This column contains the name of each of the character properties
  1668. specified in the respective data file.
  1669. Any special status for a property, such
  1670. as whether it is <a href="#Obsolete_Properties">obsolete</a>,
  1671. <a href="#Deprecated_Properties">deprecated</a>, or
  1672. <a href="#Stabilized_Properties">stabilized</a>, is also indicated in
  1673. the first column.</p>
  1674. <p><b>Second Column.</b> This column
  1675. indicates the type of the property, according to the
  1676. key in <i>Table 8</i>.</p>
  1677. <p class="caption">Table 8. <a name="Type_Key_Table" href="#Type_Key_Table">Property Type Key</a></p>
  1678. <div align="center">
  1679. <table class="simple">
  1680. <tr>
  1681. <th>Property Type</th>
  1682. <th>Symbol</th>
  1683. <th>Examples</th>
  1684. </tr>
  1685. <tr>
  1686. <td>Catalog</td>
  1687. <td style="text-align:center">C</td>
  1688. <td>Age, Block</td>
  1689. </tr>
  1690. <tr>
  1691. <td>Enumeration</td>
  1692. <td style="text-align:center">E</td>
  1693. <td>Joining_Type, Line_Break</td>
  1694. </tr>
  1695. <tr>
  1696. <td>Binary</td>
  1697. <td style="text-align:center">B</td>
  1698. <td>Uppercase, White_Space</td>
  1699. </tr>
  1700. <tr>
  1701. <td>String</td>
  1702. <td style="text-align:center">S</td>
  1703. <td>Uppercase_Mapping, Case_Folding</td>
  1704. </tr>
  1705. <tr>
  1706. <td>Numeric</td>
  1707. <td style="text-align:center">N</td>
  1708. <td>Numeric_Value</td>
  1709. </tr>
  1710. <tr>
  1711. <td>Miscellaneous</td>
  1712. <td style="text-align:center">M</td>
  1713. <td>Name, Jamo_Short_Name</td>
  1714. </tr>
  1715. </table>
  1716. </div>
  1717. <ul>
  1718. <li><b>Catalog</b> properties have enumerated values which are expected
  1719. to be regularly extended in successive versions of the Unicode Standard. This distinguishes them
  1720. from Enumeration properties.</li>
  1721. <li><b>Enumeration</b> properties have enumerated values
  1722. which constitute a logical partition space;
  1723. new values will generally not be added to them in successive versions of the standard.</li>
  1724. <li><b>Binary</b> properties are a special case of Enumeration properties, which
  1725. have exactly two values: Yes and No (or True and False).</li>
  1726. <li><b>String</b> properties
  1727. are typically mappings from a Unicode code point to another Unicode code point
  1728. or sequence of Unicode code points; examples include case mappings and
  1729. decomposition mappings.</li>
  1730. <li><b>Numeric</b> properties specify the actual numeric values
  1731. for digits and other characters associated with numbers in some way.</li>
  1732. <li><b>Miscellaneous</b> properties are those properties that do not fit neatly into the other
  1733. property categories; they currently include character names, comments about characters,
  1734. the <a href="#Script_Extensions">Script_Extensions</a> property,
  1735. and the Unicode_Radical_Stroke property (a combination of numeric values)
  1736. documented in Unicode Standard Annex #38, "Unicode Han Database (Unihan)"
  1737. [<a href="../tr41/tr41-21.html#UAX38">UAX38</a>].</li>
  1738. </ul>
  1739. <p><b>Third Column.</b> This column indicates the
  1740. status of the property: <b>N</b>ormative or <b>I</b>nformative or <b>C</b>ontributory
  1741. or <b>P</b>rovisional.</p>
  1742. <p><b>Fourth Column.</b> This column provides a description of
  1743. the property or properties. This includes information on derivation for
  1744. derived properties, as well as references to locations in the standard
  1745. where the property is defined or discussed in detail.</p>
  1746. <p>In the section of the table for <a href="#UnicodeData.txt">UnicodeData.txt</a>,
  1747. the data field numbers are also supplied in parentheses at the
  1748. start of the description.</p>
  1749. <p>For a few entries in the property table, values specified in the fields in a
  1750. data file only contribute to a full definition of a Unicode character property.
  1751. For example, the values in field 1 (Name) in
  1752. UnicodeData.txt do not provide all the values for the Name
  1753. property for all code points; <a href="#Jamo.txt">Jamo.txt</a> must also be used,
  1754. and the Name property for CJK unified ideographs, Tangut ideographs,
  1755. and Nushu ideographs is derived by rule.</p>
  1756. <p>None of the Unicode character properties should be used simply on the
  1757. basis of the descriptions in the property table without consulting the relevant
  1758. discussions in the Unicode Standard. Because of the enormous variety of
  1759. characters in the repertoire of the Unicode Standard, character properties
  1760. tend not to be self-evident in application, even when the names of the
  1761. properties may seem familiar from their usage with much smaller legacy
  1762. character encodings.</p>
  1763. <h3>5.3 <a name="Property_Definitions" href="#Property_Definitions">Property Definitions</a></h3>
  1764. <p>This section contains the table which describes each character property and defines its status, organized by data file in the UCD.
  1765. <i>Table 9</i> provides general descriptions of the Unicode character properties, their derivations,
  1766. and/or their usage, as well as pointers to the respective parts of the standard where formal property definitions or additional
  1767. information about the properties can be found. The property status column and any formal statement of the derivation
  1768. of derived properties are definitive; however, <i>Table 9</i> does not provide formal definitions of the other properties
  1769. and should not be interpreted as such. For details on the columns and overall organization of the table, see
  1770. Section 5.2 <a href="#About_Property_Table">About the Property Table</a>.</p>
  1771. <p class="caption">Table 9. <a name="Property_List_Table" href="#Property_List_Table">Property Table</a></p>
  1772. <table class="simple">
  1773. <tr>
  1774. <th valign="top" align="LEFT" colspan="4">
  1775. <a name="ArabicShaping.txt" href="#ArabicShaping.txt">ArabicShaping.txt</a></th>
  1776. </tr>
  1777. <tr>
  1778. <td><a name="Joining_Type" href="#Joining_Type">Joining_Type</a><br>
  1779. <a name="Joining_Group" href="#Joining_Group">Joining_Group</a></td>
  1780. <td>E</td>
  1781. <td valign="top">N</td>
  1782. <td>Basic Arabic and Syriac character shaping properties, such as initial, medial and final
  1783. shapes. See <i>Section 9.2, Arabic</i> in [<a href="../tr41/tr41-21.html#Unicode">Unicode</a>].
  1784. </td>
  1785. </tr>
  1786. <tr>
  1787. <th valign="top" align="LEFT" colspan="4">
  1788. <a name="BidiBrackets.txt" href="#BidiBrackets.txt">BidiBrackets.txt</a></th>
  1789. </tr>
  1790. <tr>
  1791. <td><a name="Bidi_Paired_Bracket_Type" href="#Bidi_Paired_Bracket_Type">Bidi_Paired_Bracket_Type</a></td>
  1792. <td>E</td>
  1793. <td valign="top">N</td>
  1794. <td>Type of a paired bracket, either opening or closing. This property is used in the implementation
  1795. of parenthesis matching.
  1796. See Unicode Standard Annex #9, "Unicode Bidirectional Algorithm" [<a href="../tr41/tr41-21.html#UAX9">UAX9</a>].</td>
  1797. </tr>
  1798. <tr>
  1799. <td><a name="Bidi_Paired_Bracket" href="#Bidi_Paired_Bracket">Bidi_Paired_Bracket</a></td>
  1800. <td>M</td>
  1801. <td valign="top">N</td>
  1802. <td>For an opening bracket, the code point of the matching closing bracket. For a closing bracket, the
  1803. code point of the matching opening bracket. This property is used in the implementation
  1804. of parenthesis matching.
  1805. See Unicode Standard Annex #9, "Unicode Bidirectional Algorithm" [<a href="../tr41/tr41-21.html#UAX9">UAX9</a>].</td>
  1806. </tr>
  1807. <tr>
  1808. <th valign="top" align="LEFT" colspan="4">
  1809. <a name="BidiMirroring.txt" href="#BidiMirroring.txt">BidiMirroring.txt</a></th>
  1810. </tr>
  1811. <tr>
  1812. <td><a name="Bidi_Mirroring_Glyph" href="#Bidi_Mirroring_Glyph">Bidi_Mirroring_Glyph</a></td>
  1813. <td>M</td>
  1814. <td valign="top">I</td>
  1815. <td>Informative mapping for substituting characters in an implementation of bidirectional mirroring.
  1816. This maps a subset of characters with the Bidi_Mirrored property to other
  1817. characters that normally are displayed with the corresponding mirrored glyph.
  1818. When a character with the Bidi_Mirrored property has
  1819. the default value for Bidi_Mirroring_Glyph, that means that no other character
  1820. exists whose glyph is appropriate for character-based glyph mirroring.
  1821. Implementations must then use other mechanisms to implement mirroring of those
  1822. characters for the Unicode Bidirectional Algorithm.
  1823. See Unicode Standard Annex #9, "Unicode Bidirectional Algorithm" [<a href="../tr41/tr41-21.html#UAX9">UAX9</a>]. Do not
  1824. confuse this property with the <a href="#Bidi_Mirrored">Bidi_Mirrored</a> property itself.</td>
  1825. </tr>
  1826. <tr>
  1827. <th valign="top" align="LEFT" colspan="4">
  1828. <a name="Blocks.txt" href="#Blocks.txt">Blocks.txt</a></th>
  1829. </tr>
  1830. <tr>
  1831. <td><a name="Block" href="#Block">Block</a></td>
  1832. <td>C</td>
  1833. <td valign="top">N</td>
  1834. <td>Blocks.txt specifies the Block property, which consists
  1835. of the list of block names
  1836. for ranges of code points. See
  1837. D10b in <i>Section 3.4, Characters and Encoding</i>, of
  1838. [<a href="../tr41/tr41-21.html#Unicode">Unicode</a>]. See also
  1839. the code charts in [<a href="../tr41/tr41-21.html#Unicode">Unicode</a>].</td>
  1840. </tr>
  1841. <tr>
  1842. <th valign="top" align="LEFT" colspan="4">
  1843. <a name="CompositionExclusions.txt" href="#CompositionExclusions.txt">CompositionExclusions.txt</a></th>
  1844. </tr>
  1845. <tr>
  1846. <td><a name="Composition_Exclusion" href="#Composition_Exclusion">Composition_Exclusion</a></td>
  1847. <td>B</td>
  1848. <td valign="top">N</td>
  1849. <td>
  1850. A property used in normalization. See Unicode Standard Annex #15, "Unicode Normalization Forms" [<a href="../tr41/tr41-21.html#UAX15">UAX15</a>].
  1851. Unlike other files, CompositionExclusions.txt simply lists the relevant code points.</td>
  1852. </tr>
  1853. <tr>
  1854. <th valign="top" align="LEFT" colspan="4">
  1855. <a name="CaseFolding.txt" href="#CaseFolding.txt">CaseFolding.txt</a></th>
  1856. </tr>
  1857. <tr>
  1858. <td><a name="Simple_Case_Folding" href="#Simple_Case_Folding">Simple_Case_Folding</a><br>
  1859. <a name="Case_Folding" href="#Case_Folding">Case_Folding</a></td>
  1860. <td>S</td>
  1861. <td valign="top">N</td>
  1862. <td>Mapping from characters to their case-folded forms. This is an informative file containing
  1863. normative derived properties.
  1864. <p><i>Derived from UnicodeData and SpecialCasing.</i>
  1865. <p><b>Note: </b>The case foldings are omitted in the data file if they are
  1866. the same as the code point itself.</td>
  1867. </tr>
  1868. <tr>
  1869. <th valign="top" align="LEFT" colspan="4">
  1870. <a name="DerivedAge.txt" href="#DerivedAge.txt">DerivedAge.txt</a></th>
  1871. </tr>
  1872. <tr>
  1873. <td><a name="Age" href="#Age">Age</a></td>
  1874. <td>C</td>
  1875. <td valign="top">N</td>
  1876. <td>A property defining when various code points were designated/assigned in successive versions
  1877. of the Unicode Standard.
  1878. For a detailed discussion of the Age property, see
  1879. Section 5.14, <a href="#Character_Age"><i>Character Age</i></a>.
  1880. </td>
  1881. </tr>
  1882. <tr>
  1883. <th valign="top" align="LEFT" colspan="4">
  1884. <a name="EastAsianWidth.txt" href="#EastAsianWidth.txt">EastAsianWidth.txt</a></th>
  1885. </tr>
  1886. <tr>
  1887. <td><a name="East_Asian_Width" href="#East_Asian_Width">East_Asian_Width</a></td>
  1888. <td>E</td>
  1889. <td valign="top">I</td>
  1890. <td>A property
  1891. for determining the choice of wide versus narrow glyphs in East Asian contexts.
  1892. Property values are described in Unicode Standard Annex #11, "East Asian Width" [<a href="../tr41/tr41-21.html#UAX11">UAX11</a>].</td>
  1893. </tr>
  1894. <tr>
  1895. <th valign="top" align="LEFT" colspan="4">
  1896. <a name="HangulSyllableType.txt" href="#HangulSyllableType.txt">HangulSyllableType.txt</a></th>
  1897. </tr>
  1898. <tr>
  1899. <td valign="top"><a name="Hangul_Syllable_Type" href="#Hangul_Syllable_Type">Hangul_Syllable_Type</a></td>
  1900. <td valign="top" align="center">E</td>
  1901. <td valign="top" align="center">N</td>
  1902. <td valign="top">The values L, V, T, LV, and LVT used in <i>Chapter 3, Conformance</i> in [<a href="../tr41/tr41-21.html#Unicode">Unicode</a>].</td>
  1903. </tr>
  1904. <tr>
  1905. <th valign="top" align="LEFT" colspan="4">
  1906. <a name="IndicPositionalCategory.txt" href="#IndicPositionalCategory.txt">IndicPositionalCategory.txt</a></th>
  1907. </tr>
  1908. <tr>
  1909. <td valign="top"><a name="Indic_Matra_Category"></a>
  1910. <a name="Indic_Positional_Category" href="#Indic_Positional_Category">Indic_Positional_Category</a></td>
  1911. <td valign="top" align="center">E</td>
  1912. <td valign="top" align="center">I</td>
  1913. <td valign="top">A property informally defining the
  1914. positional categories
  1915. for dependent vowels, viramas, combining marks, and other characters used in Indic scripts.
  1916. General descriptions of the property values are provided in the header section
  1917. of the data file IndicPositionalCategory.txt.</td>
  1918. </tr>
  1919. <tr>
  1920. <th valign="top" align="LEFT" colspan="4">
  1921. <a name="IndicSyllabicCategory.txt" href="#IndicSyllabicCategory.txt">IndicSyllabicCategory.txt</a></th>
  1922. </tr>
  1923. <tr>
  1924. <td valign="top"><a name="Indic_Syllabic_Category" href="#Indic_Syllabic_Category">Indic_Syllabic_Category</a></td>
  1925. <td valign="top" align="center">E</td>
  1926. <td valign="top" align="center">I</td>
  1927. <td valign="top">A property informally defining the structural categories
  1928. of syllabic components in Indic scripts.
  1929. General descriptions of the property values are provided in the header section
  1930. of the data file IndicSyllabicCategory.txt.</td>
  1931. </tr>
  1932. <tr>
  1933. <th valign="top" align="LEFT" colspan="4">
  1934. <a name="Jamo.txt" href="#Jamo.txt">Jamo.txt</a></th>
  1935. </tr>
  1936. <tr>
  1937. <td valign="top"><a name="Jamo_Short_Name" href="#Jamo_Short_Name">Jamo_Short_Name</a></td>
  1938. <td valign="top" align="center">M</td>
  1939. <td valign="top" align="center">C</td>
  1940. <td valign="top">The Hangul Syllable names are derived from the Jamo Short
  1941. Names, as described in <i>Chapter 3, Conformance</i> in [<a href="../tr41/tr41-21.html#Unicode">Unicode</a>].</td>
  1942. </tr>
  1943. <tr>
  1944. <th valign="top" align="LEFT" colspan="4">
  1945. <a name="LineBreak.txt" href="#LineBreak.txt">LineBreak.txt</a></th>
  1946. </tr>
  1947. <tr>
  1948. <td><a name="Line_Break" href="#Line_Break">Line_Break</a></td>
  1949. <td>E</td>
  1950. <td valign="top">N</td>
  1951. <td>A property
  1952. for line breaking. For more information, see Unicode Standard Annex #14, "Unicode Line Breaking
  1953. Algorithm" [<a href="../tr41/tr41-21.html#UAX14">UAX14</a>].</td>
  1954. </tr>
  1955. <tr>
  1956. <th valign="top" align="LEFT" colspan="4">
  1957. <a name="GraphemeBreakProperty.txt" href="#GraphemeBreakProperty.txt">GraphemeBreakProperty.txt</a></th>
  1958. </tr>
  1959. <tr>
  1960. <td><a name="Grapheme_Cluster_Break" href="#Grapheme_Cluster_Break">Grapheme_Cluster_Break</a></td>
  1961. <td>E</td>
  1962. <td valign="top">I</td>
  1963. <td>See Unicode Standard Annex #29, "Unicode Text Segmentation" [<a href="../tr41/tr41-21.html#UAX29">UAX29</a>]</td>
  1964. </tr>
  1965. <tr>
  1966. <th valign="top" align="LEFT" colspan="4">
  1967. <a name="SentenceBreakProperty.txt" href="#SentenceBreakProperty.txt">SentenceBreakProperty.txt</a></th>
  1968. </tr>
  1969. <tr>
  1970. <td><a name="Sentence_Break" href="#Sentence_Break">Sentence_Break</a></td>
  1971. <td>E</td>
  1972. <td valign="top">I</td>
  1973. <td>See Unicode Standard Annex #29, "Unicode Text Segmentation" [<a href="../tr41/tr41-21.html#UAX29">UAX29</a>]</td>
  1974. </tr>
  1975. <tr>
  1976. <th valign="top" align="LEFT" colspan="4">
  1977. <a name="WordBreakProperty.txt" href="#WordBreakProperty.txt">WordBreakProperty.txt</a></th>
  1978. </tr>
  1979. <tr>
  1980. <td><a name="Word_Break" href="#Word_Break">Word_Break</a></td>
  1981. <td>E</td>
  1982. <td valign="top">I</td>
  1983. <td>See Unicode Standard Annex #29, "Unicode Text Segmentation" [<a href="../tr41/tr41-21.html#UAX29">UAX29</a>]</td>
  1984. </tr>
  1985. <tr>
  1986. <th valign="top" align="LEFT" colspan="4">
  1987. <a name="NameAliases.txt" href="#NameAliases.txt">NameAliases.txt</a></th>
  1988. </tr>
  1989. <tr>
  1990. <td valign="top"><a name="Name_Alias" href="#Name_Alias">Name_Alias</a></td>
  1991. <td valign="top" align="center">M</td>
  1992. <td valign="top" align="center">N</td>
  1993. <td valign="top">Normative formal aliases for characters with erroneous
  1994. names, for control characters and some format characters,
  1995. and for character abbreviations, as described in <i>Chapter 4, Character Properties</i> in [<a href="../tr41/tr41-21.html#Unicode">Unicode</a>].
  1996. Aliases tagged with the type "correction", as well as a selection of aliases of other types, are
  1997. published in the Unicode Standard code charts.</td>
  1998. </tr>
  1999. <tr>
  2000. <th valign="top" align="LEFT" colspan="4">
  2001. <a name="NormalizationCorrections.txt" href="#NormalizationCorrections.txt">NormalizationCorrections.txt</a></th>
  2002. </tr>
  2003. <tr>
  2004. <td valign="top"><i>used in Decomposition Mappings</i></td>
  2005. <td valign="top" align="center">S</td>
  2006. <td valign="top" align="center">N</td>
  2007. <td valign="top">NormalizationCorrections lists code point differences for <i>
  2008. <a href="http://www.unicode.org/versions/corrigenda.html">Normalization Corrigenda</a>. </i>
  2009. For more information, see Unicode Standard Annex #15, "Unicode Normalization Forms"
  2010. [<a href="../tr41/tr41-21.html#UAX15">UAX15</a>].</td>
  2011. </tr>
  2012. <tr>
  2013. <th valign="top" align="LEFT" colspan="4">
  2014. <a name="Scripts.txt" href="#Scripts.txt">Scripts.txt</a></th>
  2015. </tr>
  2016. <tr>
  2017. <td><a name="Script" href="#Script">Script</a></td>
  2018. <td>C</td>
  2019. <td valign="top">I</td>
  2020. <td>Script values for use in regular expressions and elsewhere.
  2021. For more information, see Unicode Standard Annex
  2022. #24, "Unicode Script Property" [<a href="../tr41/tr41-21.html#UAX24">UAX24</a>].</td>
  2023. </tr>
  2024. <tr>
  2025. <th valign="top" align="LEFT" colspan="4">
  2026. <a name="ScriptExtensions.txt" href="#ScriptExtensions.txt">ScriptExtensions.txt</a></th>
  2027. </tr>
  2028. <tr>
  2029. <td><a name="Script_Extensions" href="#Script_Extensions">Script_Extensions</a></td>
  2030. <td>M</td>
  2031. <td valign="top">I</td>
  2032. <td>Enumerated sets of Script values for use in regular expressions and elsewhere.
  2033. For more information, see Unicode Standard Annex
  2034. #24, "Unicode Script Property" [<a href="../tr41/tr41-21.html#UAX24">UAX24</a>].</td>
  2035. </tr>
  2036. <tr>
  2037. <th valign="top" align="LEFT" colspan="4">
  2038. <a name="SpecialCasing.txt" href="#SpecialCasing.txt">SpecialCasing.txt</a></th>
  2039. </tr>
  2040. <tr>
  2041. <td><a name="Uppercase_Mapping" href="#Uppercase_Mapping">Uppercase_Mapping<br>
  2042. </a><a name="Lowercase_Mapping" href="#Lowercase_Mapping">Lowercase_Mapping</a><br>
  2043. <a name="Titlecase_Mapping" href="#Titlecase_Mapping">Titlecase_Mapping</a><br>
  2044. </td>
  2045. <td>S</td>
  2046. <td valign="top">I</td>
  2047. <td>Data for producing (in combination with the simple case mappings
  2048. from <a href="#UnicodeData.txt">UnicodeData.txt</a>) the full case mappings.</td>
  2049. </tr>
  2050. <tr>
  2051. <th valign="top" align="LEFT" colspan="4">
  2052. <a name="Unihan.txt" href="#Unihan.txt">Unihan</a> data files (for more
  2053. information, see [<a href="../tr41/tr41-21.html#UAX38">UAX38</a>])</th>
  2054. </tr>
  2055. <tr>
  2056. <td><a name="Numeric_Type_Han" href="#Numeric_Type_Han">Numeric_Type</a><br>
  2057. <a name="Numeric_Value_Han" href="#Numeric_Value_Han">Numeric_Value</a></td>
  2058. <td>E</td>
  2059. <td valign="top">I</td>
  2060. <td>The characters tagged with either kPrimaryNumeric,
  2061. kAccountingNumeric, or kOtherNumeric are given the property value
  2062. Numeric_Type=Numeric, and the Numeric_Value indicated
  2063. in those tags.
  2064. <p>Most characters have these numeric properties based on values from UnicodeData.txt.
  2065. See <a href="#Numeric_Type">Numeric_Type</a>.</td>
  2066. </tr>
  2067. <tr>
  2068. <td><a name="Unicode_Radical_Stroke" href="#Unicode_Radical_Stroke">Unicode_Radical_Stroke</a></td>
  2069. <td>M</td>
  2070. <td valign="top">I</td>
  2071. <td>The Unicode radical-stroke count, based on the tag
  2072. kRSUnicode.</td>
  2073. </tr>
  2074. <tr>
  2075. <th valign="top" align="LEFT" colspan="4">
  2076. <a name="DerivedCoreProperties.txt" href="#DerivedCoreProperties.txt">DerivedCoreProperties.txt</a></th>
  2077. </tr>
  2078. <tr>
  2079. <td valign="top" align="left"><a name="Lowercase" href="#Lowercase">Lowercase</a></td>
  2080. <td valign="top">B</td>
  2081. <td valign="top">I</td>
  2082. <td valign="top">Characters with the Lowercase property. For more information, see
  2083. <i>Chapter 4, Character Properties</i> in [<a href="../tr41/tr41-21.html#Unicode">Unicode</a>].<p><i>Generated from: Ll + <a href="#Other_Lowercase">Other_Lowercase</a></i></td>
  2084. </tr>
  2085. <tr>
  2086. <td valign="top" align="left"><a name="Uppercase" href="#Uppercase">Uppercase</a></td>
  2087. <td valign="top">B</td>
  2088. <td valign="top">I</td>
  2089. <td valign="top">Characters with the Uppercase property. For more information, see
  2090. <i>Chapter 4, Character Properties</i> in [<a href="../tr41/tr41-21.html#Unicode">Unicode</a>].<p><i>Generated from: Lu + <a href="#Other_Uppercase">Other_Uppercase</a></i></td>
  2091. </tr>
  2092. <tr>
  2093. <td valign="top" align="left"><a name="Cased" href="#Cased">Cased</a></td>
  2094. <td valign="top">B</td>
  2095. <td valign="top">I</td>
  2096. <td valign="top">Characters which are considered to be either uppercase, lowercase
  2097. or titlecase characters. This property is not identical to the
  2098. Changes_When_Casemapped property. For more information, see D135 in <i>Section 3.13, Default Case
  2099. Algorithms</i> in [<a href="../tr41/tr41-21.html#Unicode">Unicode</a>].
  2100. <p><i>Generated from: <a href="#Lowercase">Lowercase</a> + <a href="#Uppercase">Uppercase</a> + Lt</i></td>
  2101. </tr>
  2102. <tr>
  2103. <td valign="top" align="left"><a name="Case_Ignorable" href="#Case_Ignorable">Case_Ignorable</a></td>
  2104. <td valign="top">B</td>
  2105. <td valign="top">I</td>
  2106. <td valign="top">Characters which are ignored for casing purposes. For more
  2107. information, see D136 in <i>Section 3.13, Default Case
  2108. Algorithms</i> in [<a href="../tr41/tr41-21.html#Unicode">Unicode</a>].
  2109. <p><i>Generated from: Mn + Me + Cf + Lm + Sk + <a href="#Word_Break">Word_Break</a>=MidLetter +
  2110. <a href="#Word_Break">Word_Break</a>=MidNumLet + <a href="#Word_Break">Word_Break</a>=Single_Quote</i></td>
  2111. </tr>
  2112. <tr>
  2113. <td valign="top" align="left"><a name="CWL" href="#CWL">Changes_When_Lowercased</a></td>
  2114. <td valign="top">B</td>
  2115. <td valign="top">I</td>
  2116. <td valign="top">Characters whose normalized forms are not stable under a toLowercase
  2117. mapping. For more information, see D139 in <i>Section 3.13, Default Case
  2118. Algorithms</i> in [<a href="../tr41/tr41-21.html#Unicode">Unicode</a>].
  2119. <p><i>Generated from: toLowercase(toNFD(X)) != toNFD(X)</i></td>
  2120. </tr>
  2121. <tr>
  2122. <td valign="top" align="left"><a name="CWU" href="#CWU">Changes_When_Uppercased</a></td>
  2123. <td valign="top">B</td>
  2124. <td valign="top">I</td>
  2125. <td valign="top">Characters whose normalized forms are not stable under a toUppercase
  2126. mapping. For more information, see D140 in <i>Section 3.13, Default Case
  2127. Algorithms</i> in [<a href="../tr41/tr41-21.html#Unicode">Unicode</a>].
  2128. <p><i>Generated from: toUppercase(toNFD(X)) != toNFD(X)</i></td>
  2129. </tr>
  2130. <tr>
  2131. <td valign="top" align="left"><a name="CWT" href="#CWT">Changes_When_Titlecased</a></td>
  2132. <td valign="top">B</td>
  2133. <td valign="top">I</td>
  2134. <td valign="top">Characters whose normalized forms are not stable under a toTitlecase
  2135. mapping. For more information, see D141 in <i>Section 3.13, Default Case
  2136. Algorithms</i> in [<a href="../tr41/tr41-21.html#Unicode">Unicode</a>].
  2137. <p><i>Generated from: toTitlecase(toNFD(X)) != toNFD(X)</i></td>
  2138. </tr>
  2139. <tr>
  2140. <td valign="top" align="left"><a name="CWCF" href="#CWCF">Changes_When_Casefolded</a></td>
  2141. <td valign="top">B</td>
  2142. <td valign="top">I</td>
  2143. <td valign="top">Characters whose normalized forms are not stable under case
  2144. folding. For more information, see D142 in <i>Section 3.13, Default Case
  2145. Algorithms</i> in [<a href="../tr41/tr41-21.html#Unicode">Unicode</a>].
  2146. <p><i>Generated from: toCasefold(toNFD(X)) != toNFD(X)</i></td>
  2147. </tr>
  2148. <tr>
  2149. <td valign="top" align="left"><a name="CWCM" href="#CWCM">Changes_When_Casemapped</a></td>
  2150. <td valign="top">B</td>
  2151. <td valign="top">I</td>
  2152. <td valign="top">Characters which may change when they undergo case mapping.
  2153. For more information, see D143 in <i>Section 3.13, Default Case
  2154. Algorithms</i> in [<a href="../tr41/tr41-21.html#Unicode">Unicode</a>].
  2155. <p><i>Generated from: Changes_When_Lowercased(X) or Changes_When_Uppercased(X) or
  2156. Changes_When_Titlecased(X)</i></td>
  2157. </tr>
  2158. <tr>
  2159. <td valign="top" align="left"><a name="Alphabetic" href="#Alphabetic">Alphabetic</a></td>
  2160. <td valign="top">B</td>
  2161. <td valign="top">I</td>
  2162. <td valign="top">Characters with the Alphabetic property. For more information, see
  2163. <i>Chapter 4, Character Properties</i> in [<a href="../tr41/tr41-21.html#Unicode">Unicode</a>].
  2164. <p><i>Generated from:
  2165. <a href="#Lowercase">Lowercase</a> + <a href="#Uppercase">Uppercase</a> + Lt + Lm +
  2166. Lo + Nl + <a href="#Other_Alphabetic">Other_Alphabetic</a></i></td>
  2167. </tr>
  2168. <tr>
  2169. <td valign="top" align="left"><a name="Default_Ignorable_Code_Point" href="#Default_Ignorable_Code_Point">
  2170. Default_Ignorable_Code_Point</a></td>
  2171. <td valign="top">B</td>
  2172. <td valign="top">N</td>
  2173. <td valign="top">For programmatic determination of default ignorable code points. New
  2174. characters that should be ignored in rendering (unless explicitly supported) will be assigned
  2175. in these ranges, permitting programs to correctly handle the default rendering of such
  2176. characters when not otherwise supported. For more information, see the FAQ
  2177. <a href="http://www.unicode.org/faq/unsup_char.html">Display of Unsupported Characters</a>,
  2178. and <i>Section 5.21, Ignoring Characters in Processing</i>
  2179. in [<a href="../tr41/tr41-21.html#Unicode">Unicode</a>].
  2180. <p><i>Generated from<br>
  2181. <a href="#Other_Default_Ignorable_Code_Point">Other_Default_Ignorable_Code_Point</a><br>
  2182. + Cf (format characters)<br>
  2183. + Variation_Selector<br>
  2184. - White_Space<br>
  2185. - FFF9..FFFB (annotation characters)<br>
  2186. - 0600..0605, 06DD, 070F, 08E2, 110BD (exceptional Cf characters that should be visible)</i></td>
  2187. </tr>
  2188. <tr>
  2189. <td valign="top" align="left"><a name="Grapheme_Base" href="#Grapheme_Base">Grapheme_Base</a></td>
  2190. <td valign="top">B</td>
  2191. <td valign="top">N</td>
  2192. <td valign="top">Property used together with the definition of Standard Korean Syllable
  2193. Block to define "Grapheme base". See D58 in <i>Chapter 3, Conformance</i> in [<a href="../tr41/tr41-21.html#Unicode">Unicode</a>].
  2194. <p><i>Generated from: [0..10FFFF] - Cc - Cf - Cs - Co - Cn - Zl - Zp -
  2195. <a href="#Grapheme_Extend">Grapheme_Extend</a></i>
  2196. <p><b>Note:</b> Grapheme_Base is a property of individual characters. That usage contrasts
  2197. with "grapheme base", which is an attribute of Unicode strings; a grapheme base may consist
  2198. of a Korean syllable which is itself represented by a sequence of conjoining jamos.</td>
  2199. </tr>
  2200. <tr>
  2201. <td valign="top" align="left"><a name="Grapheme_Extend" href="#Grapheme_Extend">Grapheme_Extend</a></td>
  2202. <td valign="top">B</td>
  2203. <td valign="top">N</td>
  2204. <td valign="top">Property used
  2205. to define "Grapheme extender". See D59 in <i>Chapter 3, Conformance</i> in [<a href="../tr41/tr41-21.html#Unicode">Unicode</a>].
  2206. <p><i>Generated from: Me + Mn + <a href="#Other_Grapheme_Extend">Other_Grapheme_Extend</a></i></p>
  2207. <p><b>Note:</b> The set of characters for which Grapheme_Extend=Yes is equivalent to
  2208. the set of characters for which Grapheme_Cluster_Break=Extend.</td>
  2209. </tr>
  2210. <tr>
  2211. <td valign="top" align="left"><a name="Grapheme_Link" href="#Grapheme_Link">Grapheme_Link</a>
  2212. (<a href="#Deprecated_Properties">Deprecated</a> as of 5.0.0)</td>
  2213. <td valign="top">B</td>
  2214. <td valign="top">I</td>
  2215. <td valign="top">Formerly proposed for programmatic determination of grapheme cluster boundaries.
  2216. <p><i>Generated from: Canonical_Combining_Class=Virama</i></td>
  2217. </tr>
  2218. <tr>
  2219. <td valign="top" align="left"><a name="Math" href="#Math">Math</a></td>
  2220. <td valign="top">B</td>
  2221. <td valign="top">I</td>
  2222. <td valign="top">Characters with the Math property. For more information, see
  2223. <i>Chapter 4, Character Properties</i> in [<a href="../tr41/tr41-21.html#Unicode">Unicode</a>].<p><i>Generated from: Sm + <a href="#Other_Math">Other_Math</a></i></td>
  2224. </tr>
  2225. <tr>
  2226. <td valign="top" align="left"><a name="ID_Start" href="#ID_Start">ID_Start</a></td>
  2227. <td valign="top">B</td>
  2228. <td valign="top">I</td>
  2229. <td valign="top" rowspan="4">Used to determine programming identifiers, as described
  2230. in Unicode Standard Annex #31, "Unicode Identifier and Pattern Syntax" [<a href="../tr41/tr41-21.html#UAX31">UAX31</a>].</td>
  2231. </tr>
  2232. <tr>
  2233. <td valign="top" align="left"><a name="ID_Continue" href="#ID_Continue">ID_Continue</a></td>
  2234. <td valign="top">B</td>
  2235. <td valign="top">I</td>
  2236. </tr>
  2237. <tr>
  2238. <td valign="top" align="left"><a name="XID_Start" href="#XID_Start">XID_Start</a></td>
  2239. <td valign="top">B</td>
  2240. <td valign="top">I</td>
  2241. </tr>
  2242. <tr>
  2243. <td valign="top" align="left"><a name="XID_Continue" href="#XID_Continue">XID_Continue</a></td>
  2244. <td valign="top">B</td>
  2245. <td valign="top">I</td>
  2246. </tr>
  2247. <tr>
  2248. <th valign="top" align="LEFT" colspan="4">
  2249. <a name="DerivedNormalizationProps.txt" href="#DerivedNormalizationProps.txt">DerivedNormalizationProps.txt</a></th>
  2250. </tr>
  2251. <tr>
  2252. <td valign="top" align="left"><a name="Full_Composition_Exclusion" href="#Full_Composition_Exclusion">Full_Composition_Exclusion</a></td>
  2253. <td valign="top">B</td>
  2254. <td valign="top">N</td>
  2255. <td valign="top">Characters that are excluded from composition: those listed explicitly in
  2256. CompositionExclusions.txt, plus the derivable sets of
  2257. <i>Singleton Decompositions</i> and
  2258. <i>Non-Starter Decompositions</i>, as documented in that data file.</td>
  2259. </tr>
  2260. <tr>
  2261. <td valign="top" align="left"><a name="Expands_On_NFC" href="#Expands_On_NFC">Expands_On_NFC</a><br>
  2262. <a name="Expands_On_NFD" href="#Expands_On_NFD">Expands_On_NFD</a><br>
  2263. <a name="Expands_On_NFKC" href="#Expands_On_NFKC">Expands_On_NFKC</a><br>
  2264. <a name="Expands_On_NFKD" href="#Expands_On_NFKD">Expands_On_NFKD</a><br>
  2265. (<a href="#Deprecated_Properties">Deprecated</a> as of 6.0.0)</td>
  2266. <td valign="top">B</td>
  2267. <td valign="top">N</td>
  2268. <td valign="top">Characters that expand to more than one character in the specified
  2269. normalization form.</td>
  2270. </tr>
  2271. <tr>
  2272. <td valign="top" align="left"><a name="FC_NFKC_Closure" href="#FC_NFKC_Closure">FC_NFKC_Closure</a><br>
  2273. (<a href="#Deprecated_Properties">Deprecated</a> as of 6.0.0)</td>
  2274. <td valign="top">S</td>
  2275. <td valign="top">N</td>
  2276. <td valign="top">Characters that require extra mappings for closure under Case Folding plus
  2277. Normalization Form KC.
  2278. <p>The mapping is listed in Field 2.</p>
  2279. </td>
  2280. </tr>
  2281. <tr>
  2282. <td valign="top" align="left"><a name="NFD_Quick_Check" href="#NFD_Quick_Check">NFD_Quick_Check</a><br>
  2283. <a name="NFKD_Quick_Check" href="#NFKD_Quick_Check">NFKD_Quick_Check</a><br>
  2284. <a name="NFC_Quick_Check" href="#NFC_Quick_Check">NFC_Quick_Check</a><br>
  2285. <a name="NFKC_Quick_Check" href="#NFKC_Quick_Check">NFKC_Quick_Check</a></td>
  2286. <td valign="top">E</td>
  2287. <td valign="top">N</td>
  2288. <td valign="top">For property values, see <a href="#Decompositions_and_Normalization">
  2289. Decompositions and Normalization</a>. (Abbreviated names: NFD_QC, NFKD_QC, NFC_QC, NFKC_QC)</td>
  2290. </tr>
  2291. <tr>
  2292. <td valign="top" align="left"><a name="NFKC_Casefold" href="#NFKC_Casefold">NFKC_Casefold</a></td>
  2293. <td valign="top">S</td>
  2294. <td valign="top">I</td>
  2295. <td valign="top">A mapping designed for best behavior when doing caseless
  2296. matching of strings interpreted as identifiers. (Abbreviated name: NFKC_CF)
  2297. <p>For the definition of the related string
  2298. transform toNFKC_Casefold() based on this mapping, see <i>Section 3.13, Default
  2299. Case Algorithms</i> in [<a href="../tr41/tr41-21.html#Unicode">Unicode</a>].</p>
  2300. <p>The mapping is listed in Field 2.
  2301. </td>
  2302. </tr>
  2303. <tr>
  2304. <td valign="top" align="left"><a name="CWKCF" href="#CWKCF">Changes_When_NFKC_Casefolded</a></td>
  2305. <td valign="top">B</td>
  2306. <td valign="top">I</td>
  2307. <td valign="top">Characters which are not identical to their NFKC_Casefold
  2308. mapping.
  2309. <p><i>Generated from: (cp != NFKC_CaseFold(cp))</i>
  2310. </td>
  2311. </tr>
  2312. <tr>
  2313. <th valign="top" align="LEFT" colspan="4">
  2314. <a name="PropList.txt" href="#PropList.txt">PropList.txt</a></th>
  2315. </tr>
  2316. <tr>
  2317. <td valign="top" align="left"><a name="ASCII_Hex_Digit" href="#ASCII_Hex_Digit">ASCII_Hex_Digit</a></td>
  2318. <td valign="top">B</td>
  2319. <td valign="top">N</td>
  2320. <td valign="top">ASCII characters commonly used for the representation of hexadecimal numbers.</td>
  2321. </tr>
  2322. <tr>
  2323. <td valign="top" align="left"><a name="Bidi_Control" href="#Bidi_Control">Bidi_Control</a></td>
  2324. <td valign="top" align="center">B</td>
  2325. <td valign="top">N</td>
  2326. <td valign="top">Format control characters which have specific functions in the
  2327. Unicode Bidirectional Algorithm [<a href="../tr41/tr41-21.html#UAX9">UAX9</a>].</td>
  2328. </tr>
  2329. <tr>
  2330. <td valign="top" align="left"><a name="Dash" href="#Dash">Dash</a></td>
  2331. <td valign="top" align="center">B</td>
  2332. <td valign="top">I</td>
  2333. <td valign="top">Punctuation characters explicitly called out as dashes in the Unicode
  2334. Standard, plus their compatibility equivalents. Most of these have the General_Category value Pd,
  2335. but some have the General_Category value Sm because of their use in mathematics.</td>
  2336. </tr>
  2337. <tr>
  2338. <td valign="top" align="left"><a name="Deprecated" href="#Deprecated">Deprecated</a></td>
  2339. <td valign="top">B</td>
  2340. <td valign="top">N</td>
  2341. <td valign="top">For a machine-readable list of deprecated characters. No characters will ever
  2342. be removed from the standard, but the usage of deprecated characters is strongly discouraged.</td>
  2343. </tr>
  2344. <tr>
  2345. <td valign="top" align="left"><a name="Diacritic" href="#Diacritic">Diacritic</a></td>
  2346. <td valign="top" align="center">B</td>
  2347. <td valign="top">I</td>
  2348. <td valign="top">Characters that linguistically modify the meaning of another character to
  2349. which they apply. Some diacritics are not combining characters, and some combining characters
  2350. are not diacritics.</td>
  2351. </tr>
  2352. <tr>
  2353. <td valign="top" align="left"><a name="Extender" href="#Extender">Extender</a></td>
  2354. <td valign="top">B</td>
  2355. <td valign="top">I</td>
  2356. <td valign="top">Characters whose principal function is to extend the value or shape of a
  2357. preceding alphabetic character. Typical of these are length and iteration marks.</td>
  2358. </tr>
  2359. <tr>
  2360. <td valign="top" align="left"><a name="Hex_Digit" href="#Hex_Digit">Hex_Digit</a></td>
  2361. <td valign="top">B</td>
  2362. <td valign="top">I</td>
  2363. <td valign="top">Characters commonly used for the representation of hexadecimal numbers, plus
  2364. their compatibility equivalents.</td>
  2365. </tr>
  2366. <tr>
  2367. <td valign="top" align="left"><a name="Hyphen" href="#Hyphen">Hyphen</a>
  2368. (<a href="#Stabilized_Properties">Stabilized</a> as of 4.0.0;
  2369. <a href="#Deprecated_Properties">Deprecated</a> as of 6.0.0)</td>
  2370. <td valign="top">B</td>
  2371. <td valign="top">I</td>
  2372. <td valign="top">Dashes which are used to mark connections between pieces of words, plus the
  2373. <i>Katakana middle dot</i>. The <i>Katakana middle dot</i> functions like a hyphen, but is shaped like a dot
  2374. rather than a dash.</td>
  2375. </tr>
  2376. <tr>
  2377. <td valign="top" align="left"><a name="Ideographic" href="#Ideographic">Ideographic</a></td>
  2378. <td valign="top">B</td>
  2379. <td valign="top">I</td>
  2380. <td valign="top">Characters considered to be CJKV (Chinese, Japanese, Korean, and Vietnamese)
  2381. or other siniform (Chinese writing-related) ideographs. This property roughly defines the class of
  2382. "Chinese characters" and does not include characters of other
  2383. logographic scripts such as Cuneiform or Egyptian Hieroglyphs. The
  2384. Ideographic property is used in the definition of
  2385. Ideographic Description Sequences.</td>
  2386. </tr>
  2387. <tr>
  2388. <td valign="top" align="left"><a name="IDS_Binary_Operator" href="#IDS_Binary_Operator">IDS_Binary_Operator</a></td>
  2389. <td valign="top">B</td>
  2390. <td valign="top">N</td>
  2391. <td valign="top">Used in Ideographic Description Sequences.</td>
  2392. </tr>
  2393. <tr>
  2394. <td valign="top" align="left"><a name="IDS_Trinary_Operator" href="#IDS_Trinary_Operator">IDS_Trinary_Operator</a></td>
  2395. <td valign="top">B</td>
  2396. <td valign="top">N</td>
  2397. <td valign="top">Used in Ideographic Description Sequences.</td>
  2398. </tr>
  2399. <tr>
  2400. <td valign="top" align="left"><a name="Join_Control" href="#Join_Control">Join_Control</a></td>
  2401. <td valign="top">B</td>
  2402. <td valign="top">N</td>
  2403. <td valign="top">Format control characters which have specific functions for control of
  2404. cursive joining and ligation.</td>
  2405. </tr>
  2406. <tr>
  2407. <td valign="top" align="left"><a name="Logical_Order_Exception" href="#Logical_Order_Exception">Logical_Order_Exception</a></td>
  2408. <td valign="top">B</td>
  2409. <td valign="top">N</td>
  2410. <td valign="top">A small number of spacing vowel letters occurring in certain
  2411. Southeast Asian scripts such as Thai and Lao, which use a visual order display
  2412. model. These letters are stored in text ahead of syllable-initial consonants,
  2413. and require special handling for processes such as searching and sorting.</td>
  2414. </tr>
  2415. <tr>
  2416. <td valign="top" align="left"><a name="Noncharacter_Code_Point" href="#Noncharacter_Code_Point">Noncharacter_Code_Point</a></td>
  2417. <td valign="top">B</td>
  2418. <td valign="top">N</td>
  2419. <td valign="top">Code points permanently reserved for internal use.</td>
  2420. </tr>
  2421. <tr>
  2422. <td valign="top" align="left"><a name="Other_Alphabetic" href="#Other_Alphabetic">Other_Alphabetic</a></td>
  2423. <td valign="top" align="center">B</td>
  2424. <td valign="top">C</td>
  2425. <td valign="top">Used in deriving the Alphabetic property.</td>
  2426. </tr>
  2427. <tr>
  2428. <td valign="top" align="left"><a name="Other_Default_Ignorable_Code_Point" href="#Other_Default_Ignorable_Code_Point">
  2429. Other_Default_Ignorable_Code_Point</a></td>
  2430. <td valign="top">B</td>
  2431. <td valign="top">C</td>
  2432. <td valign="top">Used in deriving the Default_Ignorable_Code_Point property.</td>
  2433. </tr>
  2434. <tr>
  2435. <td valign="top" align="left"><a name="Other_Grapheme_Extend" href="#Other_Grapheme_Extend">Other_Grapheme_Extend</a></td>
  2436. <td valign="top" align="center">B</td>
  2437. <td valign="top">C</td>
  2438. <td valign="top">Used in deriving&nbsp; the Grapheme_Extend property.</td>
  2439. </tr>
  2440. <tr>
  2441. <td valign="top" align="left"><a name="Other_ID_Continue" href="#Other_ID_Continue">Other_ID_Continue</a></td>
  2442. <td valign="top">B</td>
  2443. <td valign="top">C</td>
  2444. <td valign="top">Used to maintain backward compatibility of <a href="#ID_Continue">ID_Continue</a>.</td>
  2445. </tr>
  2446. <tr>
  2447. <td valign="top" align="left"><a name="Other_ID_Start" href="#Other_ID_Start">Other_ID_Start</a></td>
  2448. <td valign="top">B</td>
  2449. <td valign="top">C</td>
  2450. <td valign="top">Used to maintain backward compatibility of <a href="#ID_Start">ID_Start</a>.</td>
  2451. </tr>
  2452. <tr>
  2453. <td valign="top" align="left"><a name="Other_Lowercase" href="#Other_Lowercase">Other_Lowercase</a></td>
  2454. <td valign="top">B</td>
  2455. <td valign="top">C</td>
  2456. <td valign="top">Used in deriving the Lowercase property.</td>
  2457. </tr>
  2458. <tr>
  2459. <td valign="top" align="left"><a name="Other_Math" href="#Other_Math">Other_Math</a></td>
  2460. <td valign="top">B</td>
  2461. <td valign="top">C</td>
  2462. <td valign="top">Used in deriving the Math property.</td>
  2463. </tr>
  2464. <tr>
  2465. <td valign="top" align="left"><a name="Other_Uppercase" href="#Other_Uppercase">Other_Uppercase</a></td>
  2466. <td valign="top">B</td>
  2467. <td valign="top">C</td>
  2468. <td valign="top">Used in deriving the Uppercase property.</td>
  2469. </tr>
  2470. <tr>
  2471. <td><a name="Pattern_Syntax" href="#Pattern_Syntax">Pattern_Syntax</a></td>
  2472. <td valign="top">B</td>
  2473. <td valign="top">N</td>
  2474. <td valign="top" rowspan="2">Used for pattern syntax as described in Unicode Standard Annex #31, "Unicode Identifier
  2475. and Pattern Syntax" [<a href="../tr41/tr41-21.html#UAX31">UAX31</a>].</td>
  2476. </tr>
  2477. <tr>
  2478. <td><a name="Pattern_White_Space" href="#Pattern_White_Space">Pattern_White_Space</a></td>
  2479. <td valign="top">B</td>
  2480. <td valign="top">N</td>
  2481. </tr>
  2482. <tr>
  2483. <td><a name="Prepended_Concatenation_Mark" href="#Prepended_Concatenation_Mark">Prepended_Concatenation_Mark</a></td>
  2484. <td valign="top">B</td>
  2485. <td valign="top">I</td>
  2486. <td valign="top">A small class of visible format controls, which precede and then span
  2487. a sequence of other characters, usually digits. These have also been known as
  2488. "subtending marks", because most of them take a form which visually extends underneath
  2489. the sequence of following digits.</td>
  2490. </tr>
  2491. <tr>
  2492. <td valign="top" align="left"><a name="Quotation_Mark" href="#Quotation_Mark">Quotation_Mark</a></td>
  2493. <td valign="top">B</td>
  2494. <td valign="top">I</td>
  2495. <td valign="top">Punctuation characters that function as quotation marks.</td>
  2496. </tr>
  2497. <tr>
  2498. <td valign="top" align="left"><a name="Radical" href="#Radical">Radical</a></td>
  2499. <td valign="top">B</td>
  2500. <td valign="top">N</td>
  2501. <td valign="top">Used in the definition of Ideographic Description Sequences.</td>
  2502. </tr>
  2503. <tr>
  2504. <td valign="top" align="left"><a name="Regional_Indicator" href="#Regional_Indicator">Regional_Indicator</a></td>
  2505. <td valign="top">B</td>
  2506. <td valign="top">N</td>
  2507. <td valign="top">Property of the regional indicator characters, U+1F1E6..U+1F1FF. This
  2508. property is referenced in various segmentation algorithms, to assist in correct
  2509. breaking around emoji flag sequences.</td>
  2510. </tr>
  2511. <tr>
  2512. <td valign="top" align="left"><a name="STerm" href="#STerm">Sentence_Terminal</a></td>
  2513. <td valign="top">B</td>
  2514. <td valign="top">I</td>
  2515. <td valign="top">Punctuation characters that generally mark the end of sentences.
  2516. Used in Unicode Standard Annex #29, "Unicode Text Segmentation"
  2517. [<a href="../tr41/tr41-21.html#UAX29">UAX29</a>].</td>
  2518. </tr>
  2519. <tr>
  2520. <td valign="top" align="left"><a name="Soft_Dotted" href="#Soft_Dotted">Soft_Dotted</a></td>
  2521. <td valign="top" align="center">B</td>
  2522. <td valign="top">N</td>
  2523. <td valign="top">Characters with a &quot;soft dot&quot;, like <i>i</i> or <i>j</i>. An accent placed on
  2524. these characters causes the dot to disappear. An explicit <i>dot above</i> can be added where
  2525. required, such as in Lithuanian. See <i>Section 7.1, Latin</i>
  2526. in [<a href="../tr41/tr41-21.html#Unicode">Unicode</a>].</td>
  2527. </tr>
  2528. <tr>
  2529. <td valign="top" align="left"><a name="Terminal_Punctuation" href="#Terminal_Punctuation">Terminal_Punctuation</a></td>
  2530. <td valign="top" align="center">B</td>
  2531. <td valign="top">I</td>
  2532. <td valign="top">Punctuation characters that generally mark the end of textual units.</td>
  2533. </tr>
  2534. <tr>
  2535. <td valign="top" align="left"><a name="Unified_Ideograph" href="#Unified_Ideograph">Unified_Ideograph</a></td>
  2536. <td valign="top">B</td>
  2537. <td valign="top">N</td>
  2538. <td valign="top">A property which specifies
  2539. the exact set of Unified CJK Ideographs in the standard. This set
  2540. excludes CJK Compatibility Ideographs (which have canonical decompositions
  2541. to Unified CJK Ideographs), as well as characters from the CJK
  2542. Symbols and Punctuation block. The class of
  2543. Unified_Ideograph=Y characters is a proper subset of the class of
  2544. Ideographic=Y characters.</td>
  2545. </tr>
  2546. <tr>
  2547. <td valign="top" align="left"><a name="Variation_Selector" href="#Variation_Selector">Variation_Selector</a></td>
  2548. <td valign="top">B</td>
  2549. <td valign="top">N</td>
  2550. <td valign="top">Indicates characters that are Variation Selectors. For
  2551. details on the behavior of these characters, see
  2552. <i>Section 23.4, Variation Selectors</i> in [<a href="../tr41/tr41-21.html#Unicode">Unicode</a>],
  2553. and Unicode Technical Standard #37, "Unicode Ideographic Variation Database" [<a href="../tr41/tr41-21.html#UTS37">UTS37</a>].</td>
  2554. </tr>
  2555. <tr>
  2556. <td valign="top" align="left"><a name="White_Space" href="#White_Space">White_Space</a></td>
  2557. <td valign="top">B</td>
  2558. <td valign="top">N</td>
  2559. <td valign="top">Spaces, separator characters and
  2560. other control characters which should be treated by
  2561. programming languages as &quot;white space&quot; for the purpose of parsing elements.
  2562. See also <a href="#Line_Break">Line_Break</a>,
  2563. <a href="#Grapheme_Cluster_Break">Grapheme_Cluster_Break</a>,
  2564. <a href="#Sentence_Break">Sentence_Break</a>,
  2565. and <a href="#Word_Break">Word_Break</a>, which classify space characters and related controls somewhat differently
  2566. for particular text segmentation contexts.
  2567. </td>
  2568. </tr>
  2569. <tr>
  2570. <th valign="top" align="LEFT" colspan="4">
  2571. <a name="UnicodeData.txt" href="#UnicodeData.txt">UnicodeData.txt</a></th>
  2572. </tr>
  2573. <tr>
  2574. <td valign="top"><a name="Name" href="#Name">Name</a></td>
  2575. <td valign="top" align="center">M</td>
  2576. <td valign="top" align="center">N</td>
  2577. <td valign="top">(1)
  2578. When a string value not enclosed in &lt;angle brackets&gt;
  2579. occurs in this field, it specifies the character's Name property value, which
  2580. matches exactly the name published in
  2581. the code charts.
  2582. The Name property value for most ideographic characters and
  2583. for Hangul syllables is derived instead by various rules. See <i>Section 4.8, Name</i> in
  2584. [<a href="../tr41/tr41-21.html#Unicode">Unicode</a>] for a full specification of those
  2585. rules. Strings enclosed in &lt;angle brackets&gt; in this field either provide label
  2586. information used in the name derivation rules, or&#x2014;in the case of characters
  2587. which have a null string as their Name property value, such as control characters&#x2014;provide
  2588. other information about their code point type.
  2589. </td>
  2590. </tr>
  2591. <tr>
  2592. <td valign="top"><a name="General_Category" href="#General_Category">General_Category</a></td>
  2593. <td valign="top" align="center">E</td>
  2594. <td valign="top" align="center">N</td>
  2595. <td valign="top">(2) This is a useful breakdown into various character types which can be used
  2596. as a default categorization in implementations. For the property values, see
  2597. <a href="#General_Category_Values">General Category Values</a>.</td>
  2598. </tr>
  2599. <tr>
  2600. <td valign="top"><a name="Canonical_Combining_Class" href="#Canonical_Combining_Class">Canonical_Combining_Class</a></td>
  2601. <td valign="top" align="center">N</td>
  2602. <td valign="top" align="center">N</td>
  2603. <td valign="top">(3) The classes used for the Canonical Ordering Algorithm in the Unicode
  2604. Standard. This property could be considered either an
  2605. enumerated property or a numeric property: the principal use of the property is in
  2606. terms of the numeric values. For the property value names associated with different numeric values, see
  2607. <a href="#DerivedCombiningClass.txt">DerivedCombiningClass.txt</a> and <a href="#Canonical_Combining_Class_Values">Canonical Combining
  2608. Class Values</a>.</td>
  2609. </tr>
  2610. <tr>
  2611. <td valign="top"><a name="Bidi_Class" href="#Bidi_Class">Bidi_Class</a></td>
  2612. <td valign="top" align="center">E</td>
  2613. <td valign="top" align="center">N</td>
  2614. <td valign="top">(4) These are the categories required by the Unicode Bidirectional Algorithm.
  2615. For the property values, see <a href="#Bidi_Class_Values">Bidirectional Class
  2616. Values</a>. For more information, see Unicode Standard Annex #9, "Unicode Bidirectional Algorithm"
  2617. [<a href="../tr41/tr41-21.html#UAX9">UAX9</a>].<p>
  2618. The default property values depend on the code point, and are explained in
  2619. DerivedBidiClass.txt</td>
  2620. </tr>
  2621. <tr>
  2622. <td valign="top"><a name="Decomposition_Type" href="#Decomposition_Type">Decomposition_Type</a><br>
  2623. <a name="Decomposition_Mapping" href="#Decomposition_Mapping">Decomposition_Mapping</a></td>
  2624. <td valign="top" align="center">E, S</td>
  2625. <td valign="top" align="center">N</td>
  2626. <td valign="top">(5) This field contains both values, with the type in angle brackets. The
  2627. decomposition mappings exactly match the decomposition mappings published with the character
  2628. names in the Unicode Standard. For more information, see
  2629. <a href="#Character_Decomposition_Mappings">Character Decomposition Mappings</a>.
  2630. </td>
  2631. </tr>
  2632. <tr>
  2633. <td valign="top" rowspan="3"><a name="Numeric_Type" href="#Numeric_Type">Numeric_Type</a><br>
  2634. <a name="Numeric_Value" href="#Numeric_Value">Numeric_Value</a></td>
  2635. <td valign="top" align="center">E, N</td>
  2636. <td valign="top" align="center">N</td>
  2637. <td valign="top">(6) If the character has the
  2638. property value Numeric_Type=Decimal, then the
  2639. Numeric_Value of that digit is represented with an integer
  2640. value (limited to the range 0..9) in fields 6, 7, and 8.
  2641. Characters with the property value Numeric_Type=Decimal are
  2642. restricted to digits which can be used in a decimal radix positional numeral system and
  2643. which are encoded in the standard in a contiguous ascending range 0..9. See the discussion of
  2644. <i>decimal digits</i> in <i>Chapter 4, Character Properties</i> in [<a href="../tr41/tr41-21.html#Unicode">Unicode</a>].</td>
  2645. </tr>
  2646. <tr>
  2647. <td valign="top" align="center">E, N</td>
  2648. <td valign="top" align="center">N</td>
  2649. <td valign="top">(7) If the character has the
  2650. property value Numeric_Type=Digit, then the
  2651. Numeric_Value of that digit is represented with an
  2652. integer value (limited to the range 0..9) in fields 7 and 8, and field 6 is null.
  2653. This covers digits that need special handling, such as the compatibility superscript digits.
  2654. <p>Starting with Unicode 6.3.0, no newly encoded numeric characters will be
  2655. given Numeric_Type=Digit, nor will existing characters with Numeric_Type=Numeric be changed
  2656. to Numeric_Type=Digit. The distinction between those two types is not considered useful.</p></td>
  2657. </tr>
  2658. <tr>
  2659. <td valign="top" align="center">E, N</td>
  2660. <td valign="top" align="center">N</td>
  2661. <td valign="top">(8) If the character has the
  2662. property value Numeric_Type=Numeric, then the
  2663. Numeric_Value of that character is represented with a positive or
  2664. negative integer or rational number in this field, and
  2665. fields 6 and 7 are null. This includes fractions such as, for example, &quot;1/5&quot; for
  2666. U+2155 VULGAR FRACTION ONE FIFTH.
  2667. <p>Some characters have these properties based on values from the Unihan data files. See
  2668. <a href="#Numeric_Type_Han">Numeric_Type, Han</a>.</p></td>
  2669. </tr>
  2670. <tr>
  2671. <td valign="top"><a name="Bidi_Mirrored" href="#Bidi_Mirrored">Bidi_Mirrored</a></td>
  2672. <td valign="top" align="center">B</td>
  2673. <td valign="top" align="center">N</td>
  2674. <td valign="top">(9) If the character is a &quot;mirrored&quot; character in
  2675. bidirectional text, this field has the value &quot;Y&quot;; otherwise &quot;N&quot;.
  2676. See <i>Section 4.7, Bidi Mirrored</i> of [<a href="../tr41/tr41-21.html#Unicode">Unicode</a>]. <i>Do not confuse this with
  2677. the <a href="#Bidi_Mirroring_Glyph">Bidi_Mirroring_Glyph</a> property.</i></td>
  2678. </tr>
  2679. <tr>
  2680. <td valign="top"><a name="Unicode_1_Name" href="#Unicode_1_Name">Unicode_1_Name</a>
  2681. (<a href="#Obsolete_Properties">Obsolete</a> as of 6.2.0)</td>
  2682. <td valign="top" align="center">M</td>
  2683. <td valign="top" align="center">I</td>
  2684. <td valign="top">(10) Old name as published in Unicode 1.0 or
  2685. ISO 6429 names for control functions. This field is empty unless it is significantly
  2686. different from the current name for the character.
  2687. No longer used in code chart production. See <a href="#Name_Alias">Name_Alias</a>.
  2688. </td>
  2689. </tr>
  2690. <tr>
  2691. <td valign="top"><a name="ISO_Comment" href="#ISO_Comment">ISO_Comment</a>
  2692. (<a href="#Obsolete_Properties">Obsolete</a> as of 5.2.0;
  2693. <a href="#Deprecated_Properties">Deprecated</a> and <a href="#Stabilized_Properties">Stabilized</a>
  2694. as of 6.0.0)</td>
  2695. <td valign="top" align="center">M</td>
  2696. <td valign="top" align="center">I</td>
  2697. <td valign="top">(11) ISO 10646 comment field. It
  2698. was used for notes that appeared in parentheses in the
  2699. 10646 names list, or contained an asterisk to mark an Annex P note.
  2700. <p>As of Unicode 5.2.0, this field no longer contains any non-null values.</p>
  2701. </td>
  2702. </tr>
  2703. <tr>
  2704. <td valign="top"><a name="Simple_Uppercase_Mapping" href="#Simple_Uppercase_Mapping">Simple_Uppercase_Mapping</a></td>
  2705. <td valign="top" align="center">S</td>
  2706. <td valign="top" align="center">N</td>
  2707. <td valign="top">(12) Simple uppercase mapping (single character result).
  2708. If a character is
  2709. part of an alphabet with case distinctions, and has a simple uppercase equivalent, then the
  2710. uppercase equivalent is in this field. The
  2711. simple mappings have a single character result, where the full mappings may have
  2712. multi-character results. For more information, see <a href="#Casemapping">Case and Case Mapping</a>.
  2713. </td>
  2714. </tr>
  2715. <tr>
  2716. <td valign="top"><a name="Simple_Lowercase_Mapping" href="#Simple_Lowercase_Mapping">Simple_Lowercase_Mapping</a></td>
  2717. <td valign="top" align="center">S</td>
  2718. <td valign="top" align="center">N</td>
  2719. <td valign="top">(13) Simple lowercase mapping (single character result).
  2720. </td>
  2721. </tr>
  2722. <tr>
  2723. <td><a name="Simple_Titlecase_Mapping" href="#Simple_Titlecase_Mapping">Simple_Titlecase_Mapping</a></td>
  2724. <td valign="top" align="center">S</td>
  2725. <td valign="top" align="center">N</td>
  2726. <td valign="top">(14) Simple titlecase mapping (single character result).
  2727. <p><b>Note:</b> If this
  2728. field is null, then the Simple_Titlecase_Mapping is the same as the
  2729. Simple_Uppercase_Mapping for this character.</p></td>
  2730. </tr>
  2731. <tr>
  2732. <th colspan="4">
  2733. <a name="VerticalOrientation.txt" href="#VerticalOrientation.txt">VerticalOrientation.txt</a></th>
  2734. </tr>
  2735. <tr>
  2736. <td><a name="Vertical_Orientation" href="#Vertical_Orientation">Vertical_Orientation</a></td>
  2737. <td>E</td>
  2738. <td>I</td>
  2739. <td>A property used to establish a default for the correct orientation of characters
  2740. when used in vertical text layout, as described in Unicode Standard Annex #50,
  2741. "Unicode Vertical Text Layout"
  2742. [<a href="../tr41/tr41-21.html#UAX50">UAX50</a>].</td>
  2743. </tr>
  2744. </table>
  2745. <p>&nbsp;</p>
  2746. <h3>5.4 <a name="Derived_Extracted" href="#Derived_Extracted">Derived Extracted Properties</a></h3>
  2747. <p>A number of Unicode character properties have been separated out, reformatted,
  2748. and listed in range format, one property per file. These files
  2749. are located under the <i>extracted</i> directory of the UCD.
  2750. The exact list of derived extracted files and the extracted properties they
  2751. represent are given in <a href="#Extracted_Properties_Table"><i>Table 10</i></a>.</p>
  2752. <p>The derived extracted files are provided
  2753. primarily as a reformatting of data for properties specified in other data files.
  2754. For <i>nondefault</i> values of properties, if there is
  2755. any inadvertant mismatch between the primary data files specifying
  2756. those properties and these lists of extracted properties, the primary
  2757. data files are taken as definitive. However, for <i>default</i> values
  2758. of properties, the extracted data files are definitive. This is particularly true for properties
  2759. which have multiple default values; those properties are identified with an asterisk
  2760. in the table. See Section 4.2.9, <a href="#Default_Values">Default Values</a>.</p>
  2761. <p class="caption">Table 10. <a name="Extracted_Properties_Table" href="#Extracted_Properties_Table">Extracted Properties</a></p>
  2762. <div align="center">
  2763. <table class="simple">
  2764. <tr>
  2765. <th>File</th>
  2766. <th>Status</th>
  2767. <th>Property</th>
  2768. <th>Extracted from</th>
  2769. </tr>
  2770. <tr>
  2771. <td>DerivedBidiClass.txt</td>
  2772. <td style="text-align:center">N</td>
  2773. <td>Bidi_Class*</td>
  2774. <td>UnicodeData.txt, field 4</td>
  2775. </tr>
  2776. <tr>
  2777. <td>DerivedBinaryProperties.txt</td>
  2778. <td style="text-align:center">N</td>
  2779. <td>Bidi_Mirrored</td>
  2780. <td>UnicodeData.txt, field 9</td>
  2781. </tr>
  2782. <tr>
  2783. <td><a name="DerivedCombiningClass.txt"></a>DerivedCombiningClass.txt</td>
  2784. <td style="text-align:center">N</td>
  2785. <td>Canonical_Combining_Class</td>
  2786. <td>UnicodeData.txt, field 3</td>
  2787. </tr>
  2788. <tr>
  2789. <td>DerivedDecompositionType.txt</td>
  2790. <td style="text-align:center">N/I</td>
  2791. <td>Decomposition_Type</td>
  2792. <td>the &lt;tag&gt; in UnicodeData.txt, field 5</td>
  2793. </tr>
  2794. <tr>
  2795. <td>DerivedEastAsianWidth.txt</td>
  2796. <td style="text-align:center">I</td>
  2797. <td>East_Asian_Width*</td>
  2798. <td>EastAsianWidth.txt, field 1</td>
  2799. </tr>
  2800. <tr>
  2801. <td>DerivedGeneralCategory.txt</td>
  2802. <td style="text-align:center">N</td>
  2803. <td>General_Category</td>
  2804. <td>UnicodeData.txt, field 2</td>
  2805. </tr>
  2806. <tr>
  2807. <td>DerivedJoiningGroup.txt</td>
  2808. <td style="text-align:center">N</td>
  2809. <td>Joining_Group</td>
  2810. <td>ArabicShaping.txt, field 3</td>
  2811. </tr>
  2812. <tr>
  2813. <td>DerivedJoiningType.txt</td>
  2814. <td style="text-align:center">N</td>
  2815. <td>Joining_Type*</td>
  2816. <td>ArabicShaping.txt, field 2</td>
  2817. </tr>
  2818. <tr>
  2819. <td>DerivedLineBreak.txt</td>
  2820. <td style="text-align:center">N</td>
  2821. <td>Line_Break*</td>
  2822. <td>LineBreak.txt, field 1</td>
  2823. </tr>
  2824. <tr>
  2825. <td>DerivedName.txt</td>
  2826. <td style="text-align:center">N</td>
  2827. <td>Name</td>
  2828. <td>UnicodeData.txt, field 1</td>
  2829. </tr>
  2830. <tr>
  2831. <td>DerivedNumericType.txt</td>
  2832. <td style="text-align:center">N</td>
  2833. <td>Numeric_Type</td>
  2834. <td>UnicodeData.txt, fields 6 through 8</td>
  2835. </tr>
  2836. <tr>
  2837. <td>DerivedNumericValues.txt</td>
  2838. <td style="text-align:center">N</td>
  2839. <td>Numeric_Value</td>
  2840. <td>UnicodeData.txt, field 8</td>
  2841. </tr>
  2842. </table>
  2843. </div>
  2844. <p>For the extraction of Decomposition_Type, characters with canonical
  2845. decomposition mappings in field 5 of UnicodeData.txt have no tag. For
  2846. those characters, the extracted value is Decomposition_Type=Canonical. For characters
  2847. with compatibility decomposition mappings, there are explicit tags
  2848. in field 5, and the value of Decomposition_Type
  2849. is equivalent to those tags. The value Decomposition_Type=Canonical is
  2850. normative. Other values for Decomposition_Type are informative.</p>
  2851. <p>The value of the Name property is extracted based on the actual string value
  2852. of the data in field 1 of UnicodeData.txt, omitting any code points
  2853. with the default null string value. Then for code points in the
  2854. Hangul Syllables block, the Hangul
  2855. Syllable Name Generation algorithm defined in <i>Section 3.12, Conjoining
  2856. Jamo Behavior</i> of [<a href="../tr41/tr41-21.html#Unicode">Unicode</a>]
  2857. is applied, to create the explicit formal
  2858. names of all Hangul syllables. Characters whose names are algorithmically
  2859. defined based on suffixing the code point to a specific identifying
  2860. string prefix, such as CJK UNIFIED IDEOGRAPH-4E00, are listed with
  2861. a compact range convention in DerivedName.txt, using an
  2862. asterisk "*" character as the placeholder for the code point.
  2863. See <i>Section 4.8, Name</i> of [<a href="../tr41/tr41-21.html#Unicode">Unicode</a>]
  2864. for more information about how the Name property is derived.</p>
  2865. <p>Numeric_Value is extracted based on the actual numeric value of the
  2866. data in field 8 of UnicodeData.txt or the values
  2867. of the kPrimaryNumeric, kAccountingNumeric, or kOtherNumeric tags, for
  2868. characters listed in the Unihan data files.</p>
  2869. <p>Numeric_Type is extracted as follows. If fields 6, 7, and 8 in UnicodeData.txt
  2870. are all non-empty, then Numeric_Type=Decimal. Otherwise, if fields 7 and 8 are both
  2871. non-empty, then Numeric_Type=Digit. Otherwise, if field 8 is non-empty, then
  2872. Numeric_Type=Numeric.
  2873. For characters listed in the Unihan data files,
  2874. Numeric_Type=Numeric for characters that have kPrimaryNumeric, kAccountingNumeric,
  2875. or kOtherNumeric tags. The default value is Numeric_Type=None.</p>
  2876. <h3>5.5 <a name="Contributory_Properties" href="#Contributory_Properties">Contributory Properties</a></h3>
  2877. <p>Contributory properties contain sets of exceptions used in the generation of
  2878. other properties derived from them. The contributory properties specifically concerned with
  2879. identifiers and casing contribute to the maintenance of
  2880. stability guarantees for properties and/or to invariance relationships
  2881. between related properties. Other contributory properties are simply
  2882. defined as a convenience for property derivation.</p>
  2883. <p>Most contributory properties have names using
  2884. the pattern "Other_XXX" and are used to derive the corresponding "XXX" property.
  2885. For example, the Other_Alphabetic property is used in the derivation of the <a href="#Alphabetic">Alphabetic</a>
  2886. property.</p>
  2887. <p>Contributory properties are typically defined in
  2888. <a href="#PropList.txt">PropList.txt</a> and the corresponding derived property
  2889. is then listed in
  2890. <a href="#DerivedCoreProperties.txt">DerivedCoreProperties.txt</a>.</p>
  2891. <p><a href="#Jamo_Short_Name">Jamo_Short_Name</a> is an unusual contributory
  2892. property, both in terms of its name and how it is used. It is defined in
  2893. its own property file, Jamo.txt, and is used to derive the Name
  2894. property value for Hangul syllable characters, according to the rules
  2895. spelled out in <i>Section 3.12, Conjoining Jamo Behavior</i> in
  2896. [<a href="../tr41/tr41-21.html#Unicode">Unicode</a>].</p>
  2897. <p><i>Contributory</i> is considered to be a distinct status for a Unicode
  2898. character property. Contributory properties are neither <i>normative</i> nor
  2899. <i>informative</i>. This distinct status is marked with
  2900. the symbol "C" in the status column in the property table.
  2901. For convenience of reference, all contributory properties are also listed
  2902. in <a href="#Contributory_Properties_Table"><i>Table 10a</i></a>, along with the
  2903. properties whose derivation they contribute to.</p>
  2904. <p class="caption">Table 10a. <a name="Contributory_Properties_Table" href="#Contributory_Properties_Table">Contributory Properties</a></p>
  2905. <div align="center">
  2906. <table class="simple">
  2907. <tr>
  2908. <th>File</th>
  2909. <th>Property</th>
  2910. <th>Used in Derivation of</th>
  2911. </tr>
  2912. <tr>
  2913. <td>Jamo.txt</td>
  2914. <td>Jamo_Short_Name</td>
  2915. <td>Name</td>
  2916. </tr>
  2917. <tr>
  2918. <td rowspan="8" style="vertical-align:middle">PropList.txt</td>
  2919. <td>Other_Alphabetic</td>
  2920. <td>Alphabetic</td>
  2921. </tr>
  2922. <tr>
  2923. <td>Other_Default_Ignorable_Code_Point</td>
  2924. <td>Default_Ignorable_Code_Point</td>
  2925. </tr>
  2926. <tr>
  2927. <td>Other_Grapheme_Extend</td>
  2928. <td>Grapheme_Extend</td>
  2929. </tr>
  2930. <tr>
  2931. <td>Other_ID_Start</td>
  2932. <td>ID_Start, XID_Start</td>
  2933. </tr>
  2934. <tr>
  2935. <td>Other_ID_Continue</td>
  2936. <td>ID_Continue, XID_Continue</td>
  2937. </tr>
  2938. <tr>
  2939. <td>Other_Lowercase</td>
  2940. <td>Lowercase</td>
  2941. </tr>
  2942. <tr>
  2943. <td>Other_Math</td>
  2944. <td>Math</td>
  2945. </tr>
  2946. <tr>
  2947. <td>Other_Uppercase</td>
  2948. <td>Uppercase</td>
  2949. </tr>
  2950. </table>
  2951. </div>
  2952. <p>Contributory properties are
  2953. incomplete by themselves and are not intended for independent use. For example,
  2954. an API returning Unicode property values should implement the derived
  2955. core properties such as Alphabetic or Default_Ignorable_Code_Point,
  2956. rather than the corresponding contributory properties,
  2957. Other_Alphabetic or Other_Default_Ignorable_Code_Point.</p>
  2958. <h3>5.6 <a name="Casemapping" href="#Casemapping">Case and Case Mapping</a></h3>
  2959. <p>Case for bicameral scripts and case mapping of characters are
  2960. complicated topics in the Unicode Standard&#x2014;both because of
  2961. their inherent algorithmic complexity and because of the number of characters
  2962. and special edge cases involved.</p>
  2963. <p>This section provides a brief roadmap to discussions about these
  2964. topics, and specifications and definitions in the standard, as well
  2965. as explaining which case-related properties are defined in the UCD.</p>
  2966. <p><i>Section 3.13, Default Case Algorithms</i> in
  2967. [<a href="../tr41/tr41-21.html#Unicode">Unicode</a>]
  2968. provides formal definitions for a number of case-related concepts (<i>cased</i>,
  2969. <i>case-ignorable</i>,&nbsp;...), for
  2970. case conversion (<i>toUppercase(X)</i>,&nbsp;...), and for case detection
  2971. (<i>isUppercase(X)</i>,&nbsp;...). It also provides the formal definition
  2972. of caseless matching for the standard, taking normalization
  2973. into account.</p>
  2974. <p><i>Section 4.2, Case</i> in
  2975. [<a href="../tr41/tr41-21.html#Unicode">Unicode</a>]
  2976. introduces case and case mapping properties. <i>Table 4-3, Sources
  2977. for Case Mapping Information</i>
  2978. in [<a href="../tr41/tr41-21.html#Unicode">Unicode</a>] describes the kind of case-related
  2979. information that is available in various data files of the UCD.
  2980. <i>Table 11</i> lists those data files again, giving the
  2981. explicit list of case-related properties defined in each.
  2982. The link on each property leads its description in
  2983. <i>Table 9, <a href="#Property_List_Table">Property Table</a></i>.</p>
  2984. <p class="caption">Table 11. <a name="Case_Properties_Table" href="#Case_Properties_Table">UCD Files and Case Properties</a></p>
  2985. <div align="center">
  2986. <table class="simple">
  2987. <tr>
  2988. <th>File Name</th>
  2989. <th>Case Properties</th>
  2990. </tr>
  2991. <tr>
  2992. <td>UnicodeData.txt</td>
  2993. <td><a href="#Simple_Uppercase_Mapping">Simple_Uppercase_Mapping</a>,
  2994. <a href="#Simple_Lowercase_Mapping">Simple_Lowercase_Mapping</a>,
  2995. <a href="#Simple_Titlecase_Mapping">Simple_Titlecase_Mapping</a></td>
  2996. </tr>
  2997. <tr>
  2998. <td>SpecialCasing.txt</td>
  2999. <td><a href="#Uppercase_Mapping">Uppercase_Mapping</a>,
  3000. <a href="#Lowercase_Mapping">Lowercase_Mapping</a>,
  3001. <a href="#Titlecase_Mapping">Titlecase_Mapping</a></td>
  3002. </tr>
  3003. <tr>
  3004. <td>CaseFolding.txt</td>
  3005. <td><a href="#Simple_Case_Folding">Simple_Case_Folding</a>,
  3006. <a href="#Case_Folding">Case_Folding</a></td>
  3007. </tr>
  3008. <tr>
  3009. <td>DerivedCoreProperties.txt</td>
  3010. <td><a href="#Uppercase">Uppercase</a>,
  3011. <a href="#Lowercase">Lowercase</a>,
  3012. <a href="#Cased">Cased</a>,
  3013. <a href="#Case_Ignorable">Case_Ignorable</a>,
  3014. <a href="#CWL">Changes_When_Lowercased</a>,
  3015. <a href="#CWU">Changes_When_Uppercased</a>,
  3016. <a href="#CWT">Changes_When_Titlecased</a>,
  3017. <a href="#CWCF">Changes_When_Casefolded</a>,
  3018. <a href="#CWCM">Changes_When_Casemapped</a>
  3019. </td>
  3020. </tr>
  3021. <tr>
  3022. <td>DerivedNormalizationProps.txt</td>
  3023. <td><a href="#NFKC_Casefold">NFKC_Casefold</a>,
  3024. <a href="#CWKCF">Changes_When_NFKC_Casefolded</a></td>
  3025. </tr>
  3026. <tr>
  3027. <td>PropList.txt</td>
  3028. <td><a href="#Soft_Dotted">Soft_Dotted</a>,
  3029. <a href="#Other_Uppercase">Other_Uppercase</a>,
  3030. <a href="#Other_Lowercase">Other_Lowercase</a></td>
  3031. </tr>
  3032. </table>
  3033. </div>
  3034. <p>For compatibility with existing parsers, UnicodeData.txt only
  3035. contains case mappings for characters where they constitute one-to-one mappings;
  3036. it also omits
  3037. information about context-sensitive case mappings. Information about
  3038. these special cases can be found in the separate data file,
  3039. SpecialCasing.txt, expressed as separate properties.</p>
  3040. <p><i>Section 5.18, Case Mappings</i>, in
  3041. [<a href="../tr41/tr41-21.html#Unicode">Unicode</a>]
  3042. discusses various implementation issues for handling case,
  3043. including language-specific case mapping, as for Greek and
  3044. for Turkish. That section also describes case folding in particular detail.</p>
  3045. <p>The special casing conditions associated with case mapping for Greek,
  3046. Turkish, and Lithuanian are specified in an additional field in
  3047. <a href="#SpecialCasing.txt">SpecialCasing.txt</a>. For example, the
  3048. lowercase mapping for sigma in Greek varies according to its position
  3049. in a word. The condition list does not constitute a formal character
  3050. property in the UCD, because it is a statement about the context of occurrence
  3051. of casing behavior for a character or characters, rather than a semantic
  3052. attribute of those characters. Versions of the UCD from
  3053. Version 3.2.0 to Version 5.0.0 <i>did</i> list property aliases
  3054. for Special_Case_Condition (scc), but this was determined to be an error
  3055. when the UCD was analyzed for representation in XML; consequently,
  3056. the Special_Case_Condition property aliases were removed as of Version 5.1.0.</p>
  3057. <p>Caseless matching is of particular concern for a number of text
  3058. processing algorithms, so is also discussed at some length
  3059. in Unicode Standard Annex #31, "Unicode Identifier and Pattern Syntax"
  3060. [<a href="../tr41/tr41-21.html#UAX31">UAX31</a>] and
  3061. in Unicode Technical Standard #10, "Unicode Collation Algorithm"
  3062. [<a href="../tr41/tr41-21.html#UTS10">UTS10</a>].</p>
  3063. <p>Further information about locale-specific casing conventions
  3064. can be found in the Unicode Common Locale Data Repository
  3065. [<a href="../tr41/tr41-21.html#CLDR">CLDR</a>].</p>
  3066. <h3>5.7 <a name="Property_Values" href="#Property_Values">Property Value Lists</a></h3>
  3067. <p>The following subsections give summaries of property values for certain
  3068. Enumeration properties. Other property values
  3069. are documented in other, topically-specific annexes; for example,
  3070. the Line_Break property values are documented in
  3071. Unicode Standard Annex #14, "Unicode Line Breaking Algorithm"
  3072. [<a href="../tr41/tr41-21.html#UAX14">UAX14</a>] and the
  3073. various segmentation-related property values are documented in
  3074. Unicode Standard Annex #29, "Unicode Text Segmentation"
  3075. [<a href="../tr41/tr41-21.html#UAX29">UAX29</a>].</p>
  3076. <h4>5.7.1 <a name="General_Category_Values" href="#General_Category_Values">General Category Values</a></h4>
  3077. <p>The General_Category property of a code point provides for the
  3078. most general classification of that code point. It is usually
  3079. determined based on the primary characteristic of the assigned
  3080. character for that code point. For example, is the character a letter,
  3081. a mark, a number, punctuation, or a symbol, and if so, of what
  3082. type? Other General_Category values define the classification of
  3083. code points which are not assigned to regular graphic characters,
  3084. including such statuses as private-use, control, surrogate code
  3085. point, and reserved unassigned.</p>
  3086. <p>Many characters have multiple uses, and not all such cases
  3087. can be captured entirely by the General_Category value. For example,
  3088. the General_Category value of Latin, Greek, or Hebrew letters does not
  3089. attempt to cover (or preclude) the numerical use of such letters
  3090. as Roman numerals or in other numerary systems. Conversely, the
  3091. General_Category of ASCII digits 0..9 as Nd (decimal digit)
  3092. neither attempts to cover (or preclude) the occasional use of
  3093. these digits as letters in various orthographies. The General_Category
  3094. is simply the first-order, most usual categorization of a
  3095. character.</p>
  3096. <p>For more information about the General_Category
  3097. property, see <i>Chapter 4, Character Properties</i> in [<a href="../tr41/tr41-21.html#Unicode">Unicode</a>].</p>
  3098. <p>The values in the General_Category field in UnicodeData.txt
  3099. make use of the short, abbreviated property value aliases
  3100. for General_Category. For convenience in reference, <i>Table 12</i>
  3101. lists all the abbreviated and long value aliases for General_Category values, reproduced from
  3102. <a href="#PropertyValueAliases.txt">PropertyValueAliases.txt</a>,
  3103. along with a brief description of each category.</p>
  3104. <p class="caption">Table 12. <a name="GC_Values_Table" href="#GC_Values_Table">General_Category Values</a></p>
  3105. <div align="center">
  3106. <table class="simple">
  3107. <tr>
  3108. <th>Abbr</th>
  3109. <th>Long</th>
  3110. <th>Description</th>
  3111. </tr>
  3112. <tr>
  3113. <td>Lu</td>
  3114. <td>Uppercase_Letter</td>
  3115. <td>an uppercase letter</td>
  3116. </tr>
  3117. <tr>
  3118. <td>Ll</td>
  3119. <td>Lowercase_Letter</td>
  3120. <td>a lowercase letter</td>
  3121. </tr>
  3122. <tr>
  3123. <td>Lt</td>
  3124. <td>Titlecase_Letter</td>
  3125. <td>a digraphic character, with first part uppercase</td>
  3126. </tr>
  3127. <tr class="lightblue">
  3128. <td>LC</td>
  3129. <td>Cased_Letter</td>
  3130. <td>Lu | Ll | Lt</td>
  3131. </tr>
  3132. <tr>
  3133. <td>Lm</td>
  3134. <td>Modifier_Letter</td>
  3135. <td>a modifier letter</td>
  3136. </tr>
  3137. <tr>
  3138. <td>Lo</td>
  3139. <td>Other_Letter</td>
  3140. <td>other letters, including syllables and ideographs</td>
  3141. </tr>
  3142. <tr class="lightblue">
  3143. <td>L</td>
  3144. <td>Letter</td>
  3145. <td>Lu | Ll | Lt | Lm | Lo</td>
  3146. </tr>
  3147. <tr>
  3148. <td>Mn</td>
  3149. <td>Nonspacing_Mark</td>
  3150. <td>a nonspacing combining mark (zero advance width)</td>
  3151. </tr>
  3152. <tr>
  3153. <td>Mc</td>
  3154. <td>Spacing_Mark</td>
  3155. <td>a spacing combining mark (positive advance width)</td>
  3156. </tr>
  3157. <tr>
  3158. <td>Me</td>
  3159. <td>Enclosing_Mark</td>
  3160. <td>an enclosing combining mark</td>
  3161. </tr>
  3162. <tr class="lightblue">
  3163. <td>M</td>
  3164. <td>Mark</td>
  3165. <td>Mn | Mc | Me</td>
  3166. </tr>
  3167. <tr>
  3168. <td>Nd</td>
  3169. <td>Decimal_Number</td>
  3170. <td>a decimal digit</td>
  3171. </tr>
  3172. <tr>
  3173. <td>Nl</td>
  3174. <td>Letter_Number</td>
  3175. <td>a letterlike numeric character</td>
  3176. </tr>
  3177. <tr>
  3178. <td>No</td>
  3179. <td>Other_Number</td>
  3180. <td>a numeric character of other type</td>
  3181. </tr>
  3182. <tr class="lightblue">
  3183. <td>N</td>
  3184. <td>Number</td>
  3185. <td>Nd | Nl | No</td>
  3186. </tr>
  3187. <tr>
  3188. <td>Pc</td>
  3189. <td>Connector_Punctuation</td>
  3190. <td>a connecting punctuation mark, like a tie</td>
  3191. </tr>
  3192. <tr>
  3193. <td>Pd</td>
  3194. <td>Dash_Punctuation</td>
  3195. <td>a dash or hyphen punctuation mark</td>
  3196. </tr>
  3197. <tr>
  3198. <td>Ps</td>
  3199. <td>Open_Punctuation</td>
  3200. <td>an opening punctuation mark (of a pair)</td>
  3201. </tr>
  3202. <tr>
  3203. <td>Pe</td>
  3204. <td>Close_Punctuation</td>
  3205. <td>a closing punctuation mark (of a pair)</td>
  3206. </tr>
  3207. <tr>
  3208. <td>Pi</td>
  3209. <td>Initial_Punctuation</td>
  3210. <td>an initial quotation mark</td>
  3211. </tr>
  3212. <tr>
  3213. <td>Pf</td>
  3214. <td>Final_Punctuation</td>
  3215. <td>a final quotation mark</td>
  3216. </tr>
  3217. <tr>
  3218. <td>Po</td>
  3219. <td>Other_Punctuation</td>
  3220. <td>a punctuation mark of other type</td>
  3221. </tr>
  3222. <tr class="lightblue">
  3223. <td>P</td>
  3224. <td>Punctuation</td>
  3225. <td>Pc | Pd | Ps | Pe | Pi | Pf | Po</td>
  3226. </tr>
  3227. <tr>
  3228. <td>Sm</td>
  3229. <td>Math_Symbol</td>
  3230. <td>a symbol of mathematical use</td>
  3231. </tr>
  3232. <tr>
  3233. <td>Sc</td>
  3234. <td>Currency_Symbol</td>
  3235. <td>a currency sign</td>
  3236. </tr>
  3237. <tr>
  3238. <td>Sk</td>
  3239. <td>Modifier_Symbol</td>
  3240. <td>a non-letterlike modifier symbol</td>
  3241. </tr>
  3242. <tr>
  3243. <td>So</td>
  3244. <td>Other_Symbol</td>
  3245. <td>a symbol of other type</td>
  3246. </tr>
  3247. <tr class="lightblue">
  3248. <td>S</td>
  3249. <td>Symbol</td>
  3250. <td>Sm | Sc | Sk | So</td>
  3251. </tr>
  3252. <tr>
  3253. <td>Zs</td>
  3254. <td>Space_Separator</td>
  3255. <td>a space character (of various non-zero widths)</td>
  3256. </tr>
  3257. <tr>
  3258. <td>Zl</td>
  3259. <td>Line_Separator</td>
  3260. <td>U+2028 LINE SEPARATOR only</td>
  3261. </tr>
  3262. <tr>
  3263. <td>Zp</td>
  3264. <td>Paragraph_Separator</td>
  3265. <td>U+2029 PARAGRAPH SEPARATOR only</td>
  3266. </tr>
  3267. <tr class="lightblue">
  3268. <td>Z</td>
  3269. <td>Separator</td>
  3270. <td>Zs | Zl | Zp</td>
  3271. </tr>
  3272. <tr>
  3273. <td>Cc</td>
  3274. <td>Control</td>
  3275. <td>a C0 or C1 control code</td>
  3276. </tr>
  3277. <tr>
  3278. <td>Cf</td>
  3279. <td>Format</td>
  3280. <td>a format control character</td>
  3281. </tr>
  3282. <tr>
  3283. <td>Cs</td>
  3284. <td>Surrogate</td>
  3285. <td>a surrogate code point</td>
  3286. </tr>
  3287. <tr>
  3288. <td>Co</td>
  3289. <td>Private_Use</td>
  3290. <td>a private-use character</td>
  3291. </tr>
  3292. <tr>
  3293. <td>Cn</td>
  3294. <td>Unassigned</td>
  3295. <td>a reserved unassigned code point or a noncharacter</td>
  3296. </tr>
  3297. <tr class="lightblue">
  3298. <td>C</td>
  3299. <td>Other</td>
  3300. <td>Cc | Cf | Cs | Co | Cn</td>
  3301. </tr>
  3302. </table>
  3303. </div>
  3304. <p>Note that the value gc=Cn does not actually
  3305. occur in UnicodeData.txt, because that data file does not list
  3306. unassigned code points.</p>
  3307. <p>The distinctions between some General_Category values
  3308. are somewhat arbitrary for edge cases, particularly those involving
  3309. symbols and punctuation. For example, a number of multiple-function
  3310. ASCII characters, including "@", "#", "%", and "&amp;", have long
  3311. been classified as Other_Punctuation (gc=Po), although they
  3312. are not among the characters used as punctuation marks in traditional
  3313. Western typography. Other characters may also be ambiguous between
  3314. functioning to organize and delimit textual units (punctuation-like)
  3315. or to represent concepts (symbol-like). Likewise, it may not always
  3316. be clear whether some symbols are primarily used for mathematics
  3317. or whether they are general symbols with occasional or even common use in mathematics.
  3318. For example, many arrow symbols are classed as Other_Symbol,
  3319. although they are widely used in mathematics. The
  3320. General_Category values constitute a rough partitioning of characters
  3321. to make distinctions for algorithmic processing, but do not
  3322. provide a definitive classification for such overlapping
  3323. or ambiguous usage of characters.</p>
  3324. <p>Characters with the quotation-related General_Category values
  3325. Pi or Pf may behave like opening punctuation (gc=Ps) or closing
  3326. punctuation (gc=Pe), depending on usage and quotation conventions.</p>
  3327. <p>General_Category values in the table highlighted
  3328. in light blue (LC, L, M, N, P, S, Z, C) stand for groupings of related
  3329. General_Category values. The classes they represent can be derived by
  3330. unions of the relevant simple values, as shown in the table. The abbreviated
  3331. and long value aliases for these classes are provided as a convenience
  3332. for implementations, such as regex, which may wish to match more generic
  3333. categories, such as "letter" or "number", rather than the detailed
  3334. subtypes for General_Category. These aliases for groupings
  3335. of General_Category values do not occur in UnicodeData.txt, which instead
  3336. always specifies the enumerated subtype for the General_Category of a character.</p>
  3337. <p>The symbol &quot;L&amp;&quot; is a label used to stand for any
  3338. combination of uppercase, lowercase or titlecase letters
  3339. (Lu, Ll, or Lt), in the first part of comments in the data files of the UCD.
  3340. It is equivalent to gc=LC, but is only a label in comments, and is
  3341. not expected to be used as an identifier for regular expression matching.</p>
  3342. <p>The Unicode Standard does not assign nondefault property
  3343. values to control characters (gc=Cc), except
  3344. for certain well-defined exceptions involving the Unicode Bidirectional Algorithm,
  3345. the Unicode Line Breaking Algorithm, and Unicode Text Segmentation.
  3346. Also, implementations will usually assign
  3347. behavior to certain line breaking control
  3348. characters&#x2014;most notably U+000D and U+000A (CR and LF)&#x2014;according to platform conventions.
  3349. See <i>Section 5.8, Newline Guidelines</i> in
  3350. [<a href="../tr41/tr41-21.html#Unicode">Unicode</a>] for more information.</p>
  3351. <h4>5.7.2 <a name="Bidi_Class_Values" href="#Bidi_Class_Values">Bidirectional Class Values</a></h4>
  3352. <p>The values in the Bidi_Class field in UnicodeData.txt
  3353. make use of the short, abbreviated property value aliases
  3354. for Bidi_Class. For convenience in reference, <i>Table 13</i>
  3355. lists all the abbreviated and long value aliases for Bidi_Class values, reproduced from
  3356. <a href="#PropertyValueAliases.txt">PropertyValueAliases.txt</a>,
  3357. along with a brief description of each category.</p>
  3358. <p class="caption">Table 13. <a name="BC_Values_Table" href="#BC_Values_Table">Bidi_Class Values</a></p>
  3359. <div align="center">
  3360. <table class="simple">
  3361. <tr>
  3362. <th>Abbr</th>
  3363. <th>Long</th>
  3364. <th>Description</th>
  3365. </tr>
  3366. <tr class="lightblue">
  3367. <td colspan="3" align="center">Strong Types</td>
  3368. </tr>
  3369. <tr>
  3370. <td>L</td>
  3371. <td>Left_To_Right</td>
  3372. <td>any strong left-to-right character</td>
  3373. </tr>
  3374. <tr>
  3375. <td>R</td>
  3376. <td>Right_To_Left</td>
  3377. <td>any strong right-to-left (non-Arabic-type) character</td>
  3378. </tr>
  3379. <tr>
  3380. <td>AL</td>
  3381. <td>Arabic_Letter</td>
  3382. <td>any strong right-to-left (Arabic-type) character</td>
  3383. </tr>
  3384. <tr class="lightblue">
  3385. <td colspan="3" align="center">Weak Types</td>
  3386. </tr>
  3387. <tr>
  3388. <td>EN</td>
  3389. <td>European_Number</td>
  3390. <td>any ASCII digit or Eastern Arabic-Indic digit</td>
  3391. </tr>
  3392. <tr>
  3393. <td>ES</td>
  3394. <td>European_Separator</td>
  3395. <td>plus and minus signs</td>
  3396. </tr>
  3397. <tr>
  3398. <td>ET</td>
  3399. <td>European_Terminator</td>
  3400. <td>a terminator in a numeric format context, includes currency signs</td>
  3401. </tr>
  3402. <tr>
  3403. <td>AN</td>
  3404. <td>Arabic_Number</td>
  3405. <td>any Arabic-Indic digit</td>
  3406. </tr>
  3407. <tr>
  3408. <td>CS</td>
  3409. <td>Common_Separator</td>
  3410. <td>commas, colons, and slashes</td>
  3411. </tr>
  3412. <tr>
  3413. <td>NSM</td>
  3414. <td>Nonspacing_Mark</td>
  3415. <td>any nonspacing mark</td>
  3416. </tr>
  3417. <tr>
  3418. <td>BN</td>
  3419. <td>Boundary_Neutral</td>
  3420. <td>most format characters, control codes, or noncharacters</td>
  3421. </tr>
  3422. <tr class="lightblue">
  3423. <td colspan="3" align="center">Neutral Types</td>
  3424. </tr>
  3425. <tr>
  3426. <td>B</td>
  3427. <td>Paragraph_Separator</td>
  3428. <td>various newline characters</td>
  3429. </tr>
  3430. <tr>
  3431. <td>S</td>
  3432. <td>Segment_Separator</td>
  3433. <td>various segment-related control codes</td>
  3434. </tr>
  3435. <tr>
  3436. <td>WS</td>
  3437. <td>White_Space</td>
  3438. <td>spaces</td>
  3439. </tr>
  3440. <tr>
  3441. <td>ON</td>
  3442. <td>Other_Neutral</td>
  3443. <td>most other symbols and punctuation marks</td>
  3444. </tr>
  3445. <tr class="lightblue">
  3446. <td colspan="3" align="center">Explicit Formatting Types</td>
  3447. </tr>
  3448. <tr>
  3449. <td>LRE</td>
  3450. <td>Left_To_Right_Embedding</td>
  3451. <td>U+202A: the LR embedding control</td>
  3452. </tr>
  3453. <tr>
  3454. <td>LRO</td>
  3455. <td>Left_To_Right_Override</td>
  3456. <td>U+202D: the LR override control</td>
  3457. </tr>
  3458. <tr>
  3459. <td>RLE</td>
  3460. <td>Right_To_Left_Embedding</td>
  3461. <td>U+202B: the RL embedding control</td>
  3462. </tr>
  3463. <tr>
  3464. <td>RLO</td>
  3465. <td>Right_To_Left_Override</td>
  3466. <td>U+202E: the RL override control</td>
  3467. </tr>
  3468. <tr>
  3469. <td>PDF</td>
  3470. <td>Pop_Directional_Format</td>
  3471. <td>U+202C: terminates an embedding or override control</td>
  3472. </tr>
  3473. <tr>
  3474. <td>LRI</td>
  3475. <td>Left_To_Right_Isolate</td>
  3476. <td>U+2066: the LR isolate control</td>
  3477. </tr>
  3478. <tr>
  3479. <td>RLI</td>
  3480. <td>Right_To_Left_Isolate</td>
  3481. <td>U+2067: the RL isolate control</td>
  3482. </tr>
  3483. <tr>
  3484. <td>FSI</td>
  3485. <td>First_Strong_Isolate</td>
  3486. <td>U+2068: the first strong isolate control</td>
  3487. </tr>
  3488. <tr>
  3489. <td>PDI</td>
  3490. <td>Pop_Directional_Isolate</td>
  3491. <td>U+2069: terminates an isolate control</td>
  3492. </tr>
  3493. </table>
  3494. </div>
  3495. <p>Please refer to Unicode Standard Annex #9, "Unicode Bidirectional Algorithm"
  3496. [<a href="../tr41/tr41-21.html#UAX9">UAX9</a>] for
  3497. an an explanation of the significance
  3498. of these values when formatting bidirectional text.</p>
  3499. <p>The four enumerated values for the isolate controls were added
  3500. in Unicode 6.3. That means there is a discontinuity in the enumeration for Bidi_Class
  3501. between Unicode 6.2 and Unicode 6.3 (and later versions) which parsers of
  3502. UnicodeData.txt and DerivedBidiClass.txt must take into account.</p>
  3503. <h4>5.7.3 <a name="Character_Decomposition_Mappings" href="#Character_Decomposition_Mappings">Character Decomposition Mapping</a></h4>
  3504. <p>The value of the Decomposition_Mapping property for a character is provided
  3505. in field 5 of UnicodeData.txt. This is a string property, consisting of a sequence
  3506. of one or more Unicode code points. The default value of the Decomposition_Mapping
  3507. property is the code point of the character itself. The use of the default value
  3508. for a character is indicated by leaving field 5 empty in UnicodeData.txt.
  3509. Informally, the value of the Decomposition_Mapping property for a character
  3510. is known simply as its <i>decomposition mapping</i>. When a character's decomposition
  3511. mapping is other than the default value, the decomposition mapping is printed out
  3512. explicitly in the names list for the Unicode code charts.</p>
  3513. <p>The prefixed tags supplied with a subset of the decomposition mappings generally indicate formatting
  3514. information. Where no such tag is given, the mapping is canonical. Conversely, the presence of a
  3515. formatting tag also indicates that the mapping is a compatibility mapping and not a canonical
  3516. mapping. In the absence of other formatting information in a compatibility mapping, the tag is
  3517. used to distinguish it from canonical mappings.</p>
  3518. <p>In some instances a canonical mapping or a compatibility mapping may consist of a single
  3519. character. For a canonical mapping, this indicates that the character is a canonical equivalent of
  3520. another single character. For a compatibility mapping, this indicates that the character is a
  3521. compatibility equivalent of another single character.</p>
  3522. <p>A canonical mapping may also consist of a pair of characters, but is never
  3523. longer than two characters. When a canonical mapping consists of a pair of characters,
  3524. the first character may itself be a character with a decomposition mapping, but the
  3525. second character never has a decomposition mapping.</p>
  3526. <p>Compatibility mappings can be much longer than canonical mappings. For historical reasons, the
  3527. longest compatibility mapping is 18 characters long. Compatibility mappings are guaranteed
  3528. to be no longer than 18 characters, although most consist of just a few characters.</p>
  3529. <p>The compatibility formatting
  3530. tags used in the UCD are listed in <i>Table 14</i>.</p>
  3531. <p class="caption">Table 14. <a name="Formatting_Tags_Table" href="#Formatting_Tags_Table">Compatibility Formatting Tags</a></p>
  3532. <div align="center">
  3533. <table class="simple">
  3534. <tr>
  3535. <th>Tag</th>
  3536. <th>Description</th>
  3537. </tr>
  3538. <tr>
  3539. <td>&lt;font&gt;</td>
  3540. <td>Font variant (for example, a blackletter form)</td>
  3541. </tr>
  3542. <tr>
  3543. <td>&lt;noBreak&gt;</td>
  3544. <td>No-break version of a space or hyphen</td>
  3545. </tr>
  3546. <tr>
  3547. <td>&lt;initial&gt;</td>
  3548. <td>Initial presentation form (Arabic)</td>
  3549. </tr>
  3550. <tr>
  3551. <td>&lt;medial&gt;</td>
  3552. <td>Medial presentation form (Arabic)</td>
  3553. </tr>
  3554. <tr>
  3555. <td>&lt;final&gt;</td>
  3556. <td>Final presentation form (Arabic)</td>
  3557. </tr>
  3558. <tr>
  3559. <td>&lt;isolated&gt;</td>
  3560. <td>Isolated presentation form (Arabic)</td>
  3561. </tr>
  3562. <tr>
  3563. <td>&lt;circle&gt;</td>
  3564. <td>Encircled form</td>
  3565. </tr>
  3566. <tr>
  3567. <td>&lt;super&gt;</td>
  3568. <td>Superscript form</td>
  3569. </tr>
  3570. <tr>
  3571. <td>&lt;sub&gt;</td>
  3572. <td>Subscript form</td>
  3573. </tr>
  3574. <tr>
  3575. <td>&lt;vertical&gt;</td>
  3576. <td>Vertical layout presentation form</td>
  3577. </tr>
  3578. <tr>
  3579. <td>&lt;wide&gt;</td>
  3580. <td>Wide (or zenkaku) compatibility character</td>
  3581. </tr>
  3582. <tr>
  3583. <td>&lt;narrow&gt;</td>
  3584. <td>Narrow (or hankaku) compatibility character</td>
  3585. </tr>
  3586. <tr>
  3587. <td>&lt;small&gt;</td>
  3588. <td>Small variant form (CNS compatibility)</td>
  3589. </tr>
  3590. <tr>
  3591. <td>&lt;square&gt;</td>
  3592. <td>CJK squared font variant</td>
  3593. </tr>
  3594. <tr>
  3595. <td>&lt;fraction&gt;</td>
  3596. <td>Vulgar fraction form</td>
  3597. </tr>
  3598. <tr>
  3599. <td>&lt;compat&gt;</td>
  3600. <td>Otherwise unspecified compatibility character</td>
  3601. </tr>
  3602. </table>
  3603. </div>
  3604. <p><b>Note: </b>There is a difference between decomposition and the
  3605. Decomposition_Mapping property. The
  3606. Decomposition_Mapping property is a string property whose
  3607. values (mappings) are defined in UnicodeData.txt, while the decomposition (also termed &quot;full
  3608. decomposition&quot;) is defined in <i>Section 3.7, Decomposition</i> in
  3609. [<a href="../tr41/tr41-21.html#Unicode">Unicode</a>] to use those mappings <i>recursively.</i></p>
  3610. <ul>
  3611. <li>The canonical decomposition is formed by recursively applying the canonical mappings, then
  3612. applying the Canonical Ordering Algorithm.</li>
  3613. <li>The compatibility decomposition is formed by recursively applying the canonical <b>and</b>
  3614. compatibility mappings, then applying the Canonical Ordering Algorithm.</li>
  3615. </ul>
  3616. <p>Starting from Unicode 2.1.9, the decomposition mappings in
  3617. <a href="#UnicodeData.txt">UnicodeData.txt</a> can be used to derive the
  3618. full decomposition of any single character in canonical order, without
  3619. the need to separately apply the Canonical Ordering Algorithm.
  3620. However, canonical ordering of combining character sequences <b><i>must</i></b> still be applied
  3621. in decomposition when normalizing source text which contains any combining marks.</p>
  3622. <p>The normalization of Hangul conjoining jamos and of Hangul syllables depends on algorithmic
  3623. mapping, as specified in <i>Section 3.12, Conjoining Jamo Behavior</i> in
  3624. [<a href="../tr41/tr41-21.html#Unicode">Unicode</a>].
  3625. That algorithm specifies the full decomposition of all precomposed Hangul syllables, but
  3626. effectively it is equivalent to the recursive application of pairwise decomposition
  3627. mappings, as for all other Unicode characters. Formally, the Decomposition_Mapping
  3628. property value for a Hangul syllable is the pairwise decomposition and not the full
  3629. decomposition.</p>
  3630. <p>Each character with the <a href="#Hangul_Syllable_Type">Hangul_Syllable_Type</a>
  3631. value LVT will have a Decomposition_Mapping consisting of a character with an LV value and a
  3632. character with a T value. Thus for U+CE31 the Decomposition_Mapping is &lt;U+CE20, U+11B8&gt;,
  3633. rather than &lt;U+110E, U+1173, U+11B8&gt;.</p>
  3634. <p>The Unihan property kCompatibilityVariant consists of a listing of the
  3635. canonical Decomposition_Mapping property values just for CJK compatibility ideographs. Because its values are
  3636. derived from UnicodeData.txt, it is formally considered to be a derived property. The exact statement
  3637. of the derivation for kCompatibilityVariant is listed in Unicode Standard Annex #38, "Unicode Han Database (Unihan)"
  3638. [<a href="../tr41/tr41-21.html#UAX38">UAX38</a>].</p>
  3639. <h4>5.7.4 <a name="Canonical_Combining_Class_Values" href="#Canonical_Combining_Class_Values">Canonical Combining Class Values</a></h4>
  3640. <p>The values in the Canonical_Combining_Class field in UnicodeData.txt
  3641. are numerical values used in the Canonical Ordering Algorithm. Some of
  3642. those numerical values also have explicit symbolic labels as property
  3643. value aliases, to make their intended application more understandable.
  3644. For convenience in reference, <i>Table 15</i>
  3645. lists the long symbolic aliases for Canonical_Combining_Class values, reproduced from
  3646. <a href="#Property_Aliases">PropertyValueAliases.txt</a>,
  3647. along with a brief description of each category. The listing for
  3648. fixed position classes, with long symbolic aliases of the form "Ccc10", and so forth, is
  3649. abbreviated, as when those labels occur they are predictable in form, based on the numeric values.</p>
  3650. <p class="caption">Table 15. <a name="CCC_Values_Table" href="#CCC_Values_Table">Canonical_Combining_Class Values</a></p>
  3651. <div align="center">
  3652. <table class="simple">
  3653. <tr>
  3654. <th>Value</th>
  3655. <th>Long</th>
  3656. <th>Description</th>
  3657. </tr>
  3658. <tr>
  3659. <td>0</td>
  3660. <td>Not_Reordered</td>
  3661. <td>Spacing and enclosing marks; also many vowel and consonant signs, even if nonspacing</td>
  3662. </tr>
  3663. <tr>
  3664. <td>1</td>
  3665. <td>Overlay</td>
  3666. <td>Marks which overlay a base letter or symbol</td>
  3667. </tr>
  3668. <tr>
  3669. <td>7</td>
  3670. <td>Nukta</td>
  3671. <td>Diacritic nukta marks in Brahmi-derived scripts</td>
  3672. </tr>
  3673. <tr>
  3674. <td>8</td>
  3675. <td>Kana_Voicing</td>
  3676. <td>Hiragana/Katakana voicing marks</td>
  3677. </tr>
  3678. <tr>
  3679. <td>9</td>
  3680. <td>Virama</td>
  3681. <td>Viramas</td>
  3682. </tr>
  3683. <tr>
  3684. <td>10</td>
  3685. <td>Ccc10</td>
  3686. <td>Start of fixed position classes</td>
  3687. </tr>
  3688. <tr>
  3689. <td>...</td>
  3690. <td>...</td>
  3691. <td>&nbsp;</td>
  3692. </tr>
  3693. <tr>
  3694. <td>199</td>
  3695. <td>&nbsp;</td>
  3696. <td>End of fixed position classes</td>
  3697. </tr>
  3698. <tr>
  3699. <td>200</td>
  3700. <td>Attached_Below_Left</td>
  3701. <td>Marks attached at the bottom left</td>
  3702. </tr>
  3703. <tr>
  3704. <td>202</td>
  3705. <td>Attached_Below</td>
  3706. <td>Marks attached directly below</td>
  3707. </tr>
  3708. <tr>
  3709. <td>204</td>
  3710. <td>&nbsp;</td>
  3711. <td>Marks attached at the bottom right</td>
  3712. </tr>
  3713. <tr>
  3714. <td>208</td>
  3715. <td>&nbsp;</td>
  3716. <td>Marks attached to the left</td>
  3717. </tr>
  3718. <tr>
  3719. <td>210</td>
  3720. <td>&nbsp;</td>
  3721. <td>Marks attached to the right</td>
  3722. </tr>
  3723. <tr>
  3724. <td>212</td>
  3725. <td>&nbsp;</td>
  3726. <td>Marks attached at the top left</td>
  3727. </tr>
  3728. <tr>
  3729. <td>214</td>
  3730. <td>Attached_Above</td>
  3731. <td>Marks attached directly above</td>
  3732. </tr>
  3733. <tr>
  3734. <td>216</td>
  3735. <td>Attached_Above_Right</td>
  3736. <td>Marks attached at the top right</td>
  3737. </tr>
  3738. <tr>
  3739. <td>218</td>
  3740. <td>Below_Left</td>
  3741. <td>Distinct marks at the bottom left</td>
  3742. </tr>
  3743. <tr>
  3744. <td>220</td>
  3745. <td>Below</td>
  3746. <td>Distinct marks directly below</td>
  3747. </tr>
  3748. <tr>
  3749. <td>222</td>
  3750. <td>Below_Right</td>
  3751. <td>Distinct marks at the bottom right</td>
  3752. </tr>
  3753. <tr>
  3754. <td>224</td>
  3755. <td>Left</td>
  3756. <td>Distinct marks to the left</td>
  3757. </tr>
  3758. <tr>
  3759. <td>226</td>
  3760. <td>Right</td>
  3761. <td>Distinct marks to the right</td>
  3762. </tr>
  3763. <tr>
  3764. <td>228</td>
  3765. <td>Above_Left</td>
  3766. <td>Distinct marks at the top left</td>
  3767. </tr>
  3768. <tr>
  3769. <td>230</td>
  3770. <td>Above</td>
  3771. <td>Distinct marks directly above</td>
  3772. </tr>
  3773. <tr>
  3774. <td>232</td>
  3775. <td>Above_Right</td>
  3776. <td>Distinct marks at the top right</td>
  3777. </tr>
  3778. <tr>
  3779. <td>233</td>
  3780. <td>Double_Below</td>
  3781. <td>Distinct marks subtending two bases</td>
  3782. </tr>
  3783. <tr>
  3784. <td>234</td>
  3785. <td>Double_Above</td>
  3786. <td>Distinct marks extending above two bases</td>
  3787. </tr>
  3788. <tr>
  3789. <td>240</td>
  3790. <td>Iota_Subscript</td>
  3791. <td>Greek iota subscript only</td>
  3792. </tr>
  3793. </table>
  3794. </div>
  3795. <p>Some of the Canonical_Combining_Class values in the table are not currently used
  3796. for any characters but are specified here for completeness. Some
  3797. values do not have long symbolic aliases and are not listed in PropertyValueAliases.txt.
  3798. Do not assume that absence of a long symbolic alias implies
  3799. non-use of a particular Canonical_Combining_Class. See
  3800. <a href="#DerivedCombiningClass.txt">DerivedCombiningClass.txt</a> for
  3801. a complete listing of the use of Canonical_Combining_Class values for
  3802. any particular version of the UCD.</p>
  3803. <p>For use in regular expression matching, fixed position classes (ccc=10 through
  3804. ccc=199) which actually occur in the Unicode Character Database for any version are
  3805. given predictable aliases of the form "Ccc10", "Ccc11", and so forth. The complete list of such aliases which
  3806. are actually defined can be found in PropertyValueAliases.txt.</p>
  3807. <p>The character property invariants regarding Canonical_Combining_Class
  3808. guarantee that values, once assigned, will never change, and
  3809. that all values used will be in the range 0..254. See
  3810. <a href="#Invariants_in_Implementations">Invariants in Implementations</a>.</p>
  3811. <p>Combining marks with ccc=224 (Left) follow their base character in storage,
  3812. as for all combining marks, but are rendered visually on the left
  3813. side of them. For all past versions of the UCD and
  3814. continuing with this version of the UCD, only two
  3815. tone marks used in certain notations for Hangul syllables have ccc=224.
  3816. Those marks are actually rendered visually on the left side of
  3817. the preceding <i>grapheme cluster</i>, in the case of Hangul syllables
  3818. resulting from sequences of conjoining jamos.</p>
  3819. <p>Those few instances of combining marks with ccc=Left should be
  3820. distinguished from the far more numerous examples of left-side vowel
  3821. signs and vowel letters in Brahmi-derived scripts.
  3822. The Canonical_Combining_Class value is zero (Not_Reordered) for both
  3823. ordinary, left-side (reordrant) vowel signs such as
  3824. U+093F DEVANAGARI VOWEL SIGN I and for Thai-style left-side
  3825. (Logical_Order_Exception=Yes) vowel letters such as U+0E40
  3826. THAI CHARACTER SARA E. The "Not_Reordered" of ccc=Not_Reordered
  3827. refers to the behavior of the character in terms of the Canonical
  3828. Ordering Algorithm as part of the definition of Unicode Normalization;
  3829. it does <i>not</i> refer to any issues of visual reordering of glyphs
  3830. involved in display and rendering. See "Canonical Ordering
  3831. Algorithm" in <i>Section 3.11,
  3832. Normalization Forms</i> in [<a href="../tr41/tr41-21.html#Unicode">Unicode</a>].</p>
  3833. <h4>5.7.5 <a name="Decompositions_and_Normalization" href="#Decompositions_and_Normalization">Decompositions and Normalization</a></h4>
  3834. <p>Decomposition is specified in <i>Chapter 3, Conformance</i> of
  3835. [<a href="../tr41/tr41-21.html#Unicode">Unicode</a>].
  3836. That chapter also
  3837. specifies the interaction between decomposition and normalization.</p>
  3838. <p>A number of derived properties related to Unicode normalization are called
  3839. the "Quick_Check" properties. These are defined to enable various optimizations
  3840. for implementations of normalization, as explained in
  3841. <i>Section 9, Detecting Normalization Forms</i>, in Unicode Standard Annex #15, "Unicode Normalization Forms"
  3842. [<a href="../tr41/tr41-21.html#UAX15">UAX15</a>].
  3843. The values for the four Quick_Check properties for all code points are listed in
  3844. DerivedNormalizationProps.txt. The interpretations of the possible property values
  3845. are summarized in <i>Table 16</i>.</p>
  3846. <p class="caption">Table 16. <a name="QC_Values_Table" href="#QC_Values_Table">Quick_Check Property Values</a></p>
  3847. <div align="center">
  3848. <table class="simple">
  3849. <tr>
  3850. <th>Property</th>
  3851. <th>Value</th>
  3852. <th>Description</th>
  3853. </tr>
  3854. <tr>
  3855. <td>NFC_QC, NFKC_QC, NFD_QC, NFKD_QC</td>
  3856. <td>No</td>
  3857. <td>Characters that cannot ever occur in the respective normalization form.</td>
  3858. </tr>
  3859. <tr>
  3860. <td>NFC_QC, NFKC_QC</td>
  3861. <td>Maybe</td>
  3862. <td>Characters that may occur in the respective normalization, depending on the context.</td>
  3863. </tr>
  3864. <tr>
  3865. <td>NFC_QC, NFKC_QC, NFD_QC, NFKD_QC</td>
  3866. <td>Yes</td>
  3867. <td>All other characters. This is the default value for Quick_Check properties.</td>
  3868. </tr>
  3869. </table>
  3870. </div>
  3871. <p>The Quick_Check property values are recommended for exposure in a public library API
  3872. which supports Unicode character properties, because they can be used to optimize
  3873. code that needs to normalize Unicode strings. They enable fast checking of whether
  3874. some input strings are already in the desired normalization form. This may make
  3875. it possible to bypass
  3876. the more time-consuming call to run the complete Unicode Normalization Algorithm
  3877. on the input string.</p>
  3878. <p>In contrast, some normalization-related Unicode character properties
  3879. are <i>not</i> recommended for exposure in a public library API. Notably, these include
  3880. <a href="#Decomposition_Mapping">Decomposition_Mapping</a>,
  3881. <a href="#Composition_Exclusion">Composition_Exclusion</a>,
  3882. and the derived <a href="#Full_Composition_Exclusion">Full_Composition_Exclusion</a>.
  3883. These properties are only used internally in a conformant implementation of
  3884. the Unicode Normalization Algorithm. Exposing them in a public API can lead
  3885. to confusion by users of the API. In particular, Decomposition_Mapping is very
  3886. easy to misinterpret as designating the <i>decomposition</i> of a character,
  3887. also known as the character's <i>full decomposition</i>. See Definitions D62 and D64
  3888. in <i>Section 3.7, Decomposition</i> in [<a href="../tr41/tr41-21.html#Unicode">Unicode</a>].</p>
  3889. <h4>5.7.6 <a name="Property_Values_As_Sets" href="#Property_Values_As_Sets">Properties Whose Values Are Sets of Values</a></h4>
  3890. <p>Most properties have a single value associated with each code point.
  3891. However, some properties may instead associate a set of multiple
  3892. different values with each code point. For example, the provisional
  3893. kCantonese property, which lists Cantonese pronunciations
  3894. for unified CJK ideographs, has values which consist of a set of
  3895. zero or more romanized pronunciation strings. Thus, the Unihan
  3896. Database contains an entry:</p>
  3897. <blockquote>
  3898. <pre>
  3899. U+342B kCantonese gun3 hung1 zung1
  3900. </pre>
  3901. </blockquote>
  3902. <p>This line is to be interpreted as associating a set of three string values,
  3903. {"gun3", "hung1", "zung1"} with the kCantonese property for U+342B.</p>
  3904. <p>Similarly, the Script_Extensions property has values which
  3905. consist of a set of one or more Script property values. Thus the
  3906. property file ScriptExtensions.txt in the UCD contains an entry:</p>
  3907. <blockquote>
  3908. <pre>
  3909. 0640 ; Adlm Arab Mand Mani Phlp Syrc # Lm ARABIC TATWEEL
  3910. </pre>
  3911. </blockquote>
  3912. <p>This line is to be interpreted as associating a set of six enumerated
  3913. Script property values, {Adlm, Arab, Mand, Mani, Phlp, Syrc}, with the Script_Extensions
  3914. property for U+0640.</p>
  3915. <p>In the case of Script_Extensions, in particular, the set of sets which
  3916. constitute meaningful values of the property is relatively small, and could be explicitly
  3917. evaluated for any particular Unicode version. For example:</p>
  3918. <blockquote>
  3919. <pre>
  3920. {{Adlm, Arab, Mand, Mani, Phlp, Syrc}, {Arab, Copt}, {Arab, Syrc}, {Arab, Thaa}, {Arab, Syrc, Thaa}, {Armn, Geor}, ...}
  3921. </pre>
  3922. </blockquote>
  3923. <p>However, an enumeration of this set of set values is unlikely to be
  3924. of much implementation value, and would be likely to change significantly between
  3925. versions of the standard. In other cases, such as for properties definining pronunciation
  3926. readings for unified CJK ideographs, these sets of sets are completely open-ended, and there
  3927. is no point to attempting to provide explicit enumerations of such sets in the UCD.</p>
  3928. <p>The order of the element values in such sets may or may not be significant.
  3929. For example, the order among the element values for kCantonese and for
  3930. Script_Extensions is not significant. By way of contrast, when the kMandarin
  3931. property shows two values for a code point, the first value is used to
  3932. indicate a preferred pronunciation for zh-Hans (CN) and the second a
  3933. preferred pronunciation for zh-Hant (TW).</p>
  3934. <p>For data file format considerations regarding properties which take
  3935. sets of values, see Section 4.2.8 <a href="#Multiple_Values">Multiple Values for Properties</a>.
  3936. For considerations regarding validation of such
  3937. properties, see Section 5.11.5 <a href="#Validation_of_Multivalued">Validation of Multivalued Properties</a>.
  3938. See also Unicode Technical Standard #18, "Unicode Regular Expressions"
  3939. [<a href="../tr41/tr41-21.html#UTS18">UTS18</a>] for a discussion of how to handle
  3940. such properties when processing regular expressions.</p>
  3941. <h3>5.8 <a name="Property_And_Value_Aliases" href="#Property_And_Value_Aliases">Property and Property Value Aliases</a></h3>
  3942. <p>Both Unicode character properties themselves and their values are
  3943. given symbolic aliases. The formal lists of aliases are provided so that
  3944. well-defined symbolic values are available for XML formats of the UCD
  3945. data, for regular expression property tests, and for other
  3946. programmatic textual descriptions of Unicode data.
  3947. The aliases for properties are defined in
  3948. PropertyAliases.txt. The aliases for property values are defined in
  3949. PropertyValueAliases.txt.</p>
  3950. <p class="caption">Table 17. <a name="Alias_Files_Table" href="#Alias_Files_Table">Alias Files in the UCD</a></p>
  3951. <div align="center">
  3952. <table class="simple">
  3953. <tr>
  3954. <th>File Name</th>
  3955. <th>Status</th>
  3956. <th>Description</th>
  3957. </tr>
  3958. <tr>
  3959. <td><a name="PropertyAliases.txt" href="#PropertyAliases.txt">PropertyAliases.txt</a></td>
  3960. <td>N</td>
  3961. <td>Names and abbreviations for properties</td>
  3962. </tr>
  3963. <tr>
  3964. <td><a name="PropertyValueAliases.txt" href="#PropertyValueAliases.txt">PropertyValueAliases.txt</a></td>
  3965. <td>N</td>
  3966. <td>Names and abbreviations for property values</td>
  3967. </tr>
  3968. </table>
  3969. </div>
  3970. <p>Aliases are defined as ASCII-compatible identifiers, using only uppercase or
  3971. lowercase A-Z, digits, and underscore "_". Case is not significant
  3972. when comparing aliases, but the preferred form used in the data files
  3973. for longer aliases is to titlecase them.</p>
  3974. <p>Aliases may be translated in appropriate environments, and additional
  3975. aliases may be useful in certain contexts. There is no requirement that
  3976. only the aliases defined in the alias files of the UCD be used when
  3977. referring to Unicode character properties or their values; however, their
  3978. use is recommended for interoperability in data formats or in
  3979. programmatic contexts.</p>
  3980. <p>Aliases may be provided
  3981. for provisional properties. There are stability guarantees for property aliases and property
  3982. value aliases, but no stability guarantees for provisional properties or other
  3983. provisional data files; consequently, there can also be
  3984. no stability guarantee for property aliases or property value aliases associated with provisional
  3985. properties.</p>
  3986. <h4>5.8.1 <a name="Property_Aliases" href="#Property_Aliases">Property Aliases</a></h4>
  3987. <p>In PropertyAliases.txt, the first field specifies an abbreviated
  3988. symbolic name for the property, and the second field specifies the
  3989. long symbolic name for the property. These are the preferred aliases.
  3990. Additional aliases for a few properties are specified in the third
  3991. or subsequent fields.</p>
  3992. <p>Aliases for normative and informative
  3993. properties defined in the Unihan data files are included in PropertyAliases.txt,
  3994. beginning with Version 5.2.</p>
  3995. <p>The long symbolic name alias is self-descriptive, and is
  3996. treated as the official name of
  3997. a Unicode character property. For clarity it is used whenever possible
  3998. when referring to that
  3999. property in this annex and elsewhere in the Unicode Standard.
  4000. For example: "The Line_Break property is discussed in Unicode Standard Annex #14, "Unicode Line
  4001. Breaking Algorithm" [<a href="../tr41/tr41-21.html#UAX14">UAX14</a>]."</p>
  4002. <p>The abbreviated symbolic name alias is short and less mnemonic,
  4003. but is useful for expressions such as "lb=BA" in data or in other
  4004. contexts where the meaning is clear.</p>
  4005. <p>The property aliases specified in PropertyAliases.txt constitute
  4006. a unique namespace. When using these symbolic values, no
  4007. alias for one property will match an alias for another property.</p>
  4008. <h4>5.8.2 <a name="Property_Value_Aliases" href="#Property_Value_Aliases">Property Value Aliases</a></h4>
  4009. <p>In PropertyValueAliases.txt, the first field contains the
  4010. abbreviated alias for a Unicode property, the second field specifies
  4011. an abbreviated symbolic name for a value of that property, and
  4012. the third field specifies the
  4013. long symbolic name for that value of that property. These are the
  4014. preferred aliases.
  4015. Additional aliases for some property values may be specified in the fourth
  4016. or subsequent fields. For example, for binary properties, the
  4017. abbreviated alias for the True value is "Y", and the long alias
  4018. is "Yes", but each entry also specifies "T" and "True" as
  4019. additional aliases for that value, as shown in <i>Table 18</i>.</p>
  4020. <p class="caption">Table 18. <a name="Binary_Values_Table" href="#Binary_Values_Table">Binary Property Value Aliases</a></p>
  4021. <div align="center">
  4022. <table class="simple">
  4023. <tr>
  4024. <th>Long</th>
  4025. <th>Abbreviated</th>
  4026. <th>Other Aliases</th>
  4027. </tr>
  4028. <tr>
  4029. <td style="text-align:center">Yes</td>
  4030. <td style="text-align:center">Y</td>
  4031. <td style="text-align:center">True, T</td>
  4032. </tr>
  4033. <tr>
  4034. <td style="text-align:center">No</td>
  4035. <td style="text-align:center">N</td>
  4036. <td style="text-align:center">False, F</td>
  4037. </tr>
  4038. </table>
  4039. </div>
  4040. <p>Not every property value has an associated alias. Property value
  4041. aliases are typically supplied for catalog and enumeration
  4042. properties, which have well-defined, enumerated values. It does not
  4043. make sense to specify property value aliases, for example, for
  4044. the Numeric_Value property, whose value could be any number, or
  4045. for a string property such as Simple_Lowercase_Mapping, whose values
  4046. are mappings from one code point to another.</p>
  4047. <p>The Canonical_Combining_Class property requires special handling
  4048. in PropertyValueAliases.txt. The values of this property are numeric,
  4049. but they comprise a closed, enumerated set of values. The more
  4050. important of those values are given symbolic name aliases.
  4051. In PropertyValueAliases.txt, the second field provides the numeric
  4052. value, while the third field contains the abbreviated symbolic
  4053. name alias and the fourth field contains the long symbolic
  4054. name alias for that numeric value. For example:</p>
  4055. <blockquote>
  4056. <pre>
  4057. ccc; 230; A ; Above
  4058. ccc; 232; AR ; Above_Right
  4059. </pre>
  4060. </blockquote>
  4061. <p>Taken by themselves, property value aliases do not constitute
  4062. a unique namespace. The abbreviated aliases, in particular,
  4063. are often re-used as aliases for values for different properties.
  4064. All of the binary property value aliases, for example, make
  4065. use of the same "Y", "Yes", "T", "True" symbols. Property value
  4066. aliases may also overlap the symbols used for property aliases.
  4067. For example, "Sc" is the abbreviated alias for the
  4068. "Currency_Symbol" value of the General_Category property, but
  4069. it is also the abbreviated alias for the Script property.
  4070. However, the aliases for values for any single property are
  4071. always unique within the context of that property. That
  4072. means that expressions that combine a property alias and
  4073. a property value alias, such as "lb=BA" or "gc=Sc" <i>always</i>
  4074. refer unambiguously just to one value of one given property,
  4075. and will not match any other value of any other property.</p>
  4076. <p>Prior to Version 6.1.0, the property value alias entries for three properties,
  4077. Age, Block, and Joining_Group, made use of a special metavalue
  4078. "n/a" in the field for the abbreviated alias. This should
  4079. be understood as meaning that no abbreviated alias was
  4080. defined for that value for that property, rather than as
  4081. an alias per se. Starting with Version 6.1.0, all property values for those
  4082. three properties have abbreviated aliases, so there is no current use of the "n/a" metavalue.</p>
  4083. <p>In a few cases, because of longstanding legacy practice
  4084. in referring to values of a property by short identifiers,
  4085. the abbreviated alias and the long alias are the same. This
  4086. can be seen, for example, in some property value aliases
  4087. for the Line_Break property and the Grapheme_Cluster_Break
  4088. property.</p>
  4089. <p>The property <a href="#Script_Extensions">Script_Extensions</a>
  4090. consists of enumerated sets of Script property values. The set of those sets is potentially
  4091. open-ended, and no property value aliases are defined for them.</p>
  4092. <h3>5.9 <a name="Matching_Rules" href="#Matching_Rules">Matching Rules</a></h3>
  4093. <p>When matching Unicode character property names
  4094. and values, it is strongly recommended that all
  4095. <a href="#Property_Aliases">Property and Property Value Aliases</a>
  4096. be recognized. For best results in matching, rather than using
  4097. exact binary comparisons, the following loose matching rules
  4098. should be observed.</p>
  4099. <h4>5.9.1 <a name="Matching_Numeric" href="#Matching_Numeric">Matching Numeric Property Values</a></h4>
  4100. <p>For all numeric properties, and for properties such as Unicode_Radical_Stroke
  4101. which are constructed from combinations
  4102. of numeric values, use loose matching rule UAX44-LM1 when comparing property values.</p>
  4103. <p><i><b><a name="UAX44-LM1" href="#UAX44-LM1">UAX44-LM1</a>.</b></i> Apply numeric equivalences.</p>
  4104. <ul>
  4105. <li>&quot;01.00&quot; is equivalent to &quot;1&quot;.</li>
  4106. <li>&quot;1.666667&quot; in the UCD is a repeating fraction, and
  4107. equivalent to "10/6" or "5/3".</li>
  4108. </ul>
  4109. <h4>5.9.2 <a name="Matching_Names" href="#Matching_Names">Matching Character Names</a></h4>
  4110. <p>Unicode character names constitute a special case. Formally, they are values
  4111. of the Name property. While each Unicode character name for an assigned character
  4112. is guaranteed to be unique, names are assigned in such a way that
  4113. the presence or absence of spaces cannot be used to distinguish them.
  4114. Furthermore, implementations sometimes create identifiers from Unicode
  4115. character names by inserting underscores for spaces. For best results
  4116. in comparing Unicode character names, use loose matching rule UAX44-LM2.</p>
  4117. <p><i><b><a name="UAX44-LM2" href="#UAX44-LM2">UAX44-LM2</a>.</b></i> Ignore case, whitespace, underscore (&#39;_&#39;), and all medial hyphens except the hyphen in
  4118. U+1180 HANGUL JUNGSEONG O-E.</p>
  4119. <ul>
  4120. <li>&quot;zero-width space&quot; is equivalent to &quot;ZERO WIDTH SPACE&quot; or &quot;zerowidthspace&quot;</li>
  4121. <li>&quot;character -a&quot; is <i>not</i> equivalent to &quot;character a&quot;</li>
  4122. </ul>
  4123. <p>In this rule "medial hyphen" is to be construed as a hyphen
  4124. occurring immediately between two letters in the normative Unicode character
  4125. name, as published in the Unicode names list, and not to any hyphen that may
  4126. transiently occur medially as a result of removing whitespace before removing hyphens in
  4127. a particular implementation of matching. Thus the hyphen in the name
  4128. U+10089 LINEAR B IDEOGRAM B107M HE-GOAT is medial, and should be ignored
  4129. in loose matching, but the hyphen in the name U+0F39 TIBETAN MARK TSA -PHRU is
  4130. <i>not</i> medial, and should not be ignored in loose matching.</p>
  4131. <p>An implementation of this loose matching rule can obtain
  4132. the correct results when comparing two strings by doing the following three
  4133. operations, in order:</p>
  4134. <ol>
  4135. <li>remove all medial hyphens (except the medial hyphen in the name for U+1180)</li>
  4136. <li>remove all whitespace and underscore characters</li>
  4137. <li>apply toLowercase() to both strings</li>
  4138. </ol>
  4139. <p>After applying these three operations, if the two strings
  4140. compare binary equal, then they are considered to match.</p>
  4141. <p>This is a logical statement of how the rule works. If programmed
  4142. carefully, an implementation of the matching rule can transform the strings in
  4143. a single pass. It is also possible to compare two name strings for loose matching
  4144. while transforming each string incrementally.</p>
  4145. <p>Loose matching rule UAX44-LM2 is also appropriate for matching
  4146. character name aliases and the names of named character sequences, which share the
  4147. namespace (and matching behavior) of Unicode character names. See <i>Section 4.8, Name</i> in
  4148. [<a href="../tr41/tr41-21.html#Unicode">Unicode</a>]</p>
  4149. <p>Implementations of name matching should use extreme care when matching
  4150. non-standard, alternative names for particular characters. The Name Uniqueness Policy
  4151. in the Unicode Consortium Stability
  4152. Policies [<a href="../tr41/tr41-21.html#Stability">Stability</a>] guarantees that
  4153. the Unicode Standard will never add a character whose name would match an existing
  4154. encoded character, according to matching rule UAX44-LM2. However, any <i>other</i>
  4155. name for a character might be used in the future.</p>
  4156. <p>The following is a concrete example of the kind of trouble that can occur.
  4157. Prior to Unicode 6.0 some implementations of regex allowed matching of the name "BELL" for
  4158. the control code U+0007. When Unicode 6.0 added a <i>different</i> encoded character,
  4159. U+1F514 BELL for emoji symbols, those regex implementations broke.</p>
  4160. <p>As of Version 6.1 of the Unicode Standard, the most commonly occurring
  4161. alternative names for control codes, as well as many commonly used abbreviations for
  4162. Unicode format characters, have been added as character name aliases. This automatically
  4163. excludes all such alternative names and abbreviations from the potential pool for
  4164. future Unicode character names, because name uniqueness is defined over the namespace
  4165. which includes both character names and character name aliases. That exclusion should
  4166. reduce the potential for surprises similar to the "BELL" case, where implementers
  4167. assume that a name for a control code is already well-defined.</p>
  4168. <h4>5.9.3 <a name="Matching_Symbolic" href="#Matching_Symbolic">Matching Symbolic Values</a></h4>
  4169. <p>Property aliases and property value aliases are symbolic values. When
  4170. comparing them, use loose matching rule UAX44-LM3.</p>
  4171. <p><i><b><a name="UAX44-LM3" href="#UAX44-LM3">UAX44-LM3</a>.</b></i> Ignore case, whitespace, underscore (&#39;_&#39;),
  4172. hyphens, and any initial prefix string "is".</p>
  4173. <ul>
  4174. <li>&quot;linebreak&quot; is equivalent to &quot;Line_Break&quot; or &quot;Line-break&quot;</li>
  4175. <li>"lb=BA" is equivalent to "lb=ba" or "LB=BA"</li>
  4176. <li>"Script=Greek" is equivalent to "Script=isGreek" or "Script=Is_Greek"</li>
  4177. </ul>
  4178. <p>Loose matching is generally appropriate for the property values of
  4179. Catalog, Enumeration, and Binary properties, which have symbolic aliases
  4180. defined for their values.
  4181. Loose matching should not be done for the property values of String properties,
  4182. which do not have symbolic aliases defined for their values; exact
  4183. matching for String property values is important, as
  4184. case distinctions or other distinctions in those values may be significant.</p>
  4185. <p>For loose matching of symbolic values, an initial prefix string "is" is
  4186. ignored. The reason for this is that APIs returning property values are often
  4187. named using the convention of prefixing "is" (or "Is" or "Is_", and so forth) to
  4188. a property value. Ignoring any initial "is" on a symbolic value during loose
  4189. matching is likely to produce the best results in application areas such as
  4190. regex. Removal of an initial "is" string for a loose matching comparison only
  4191. needs to be done once for a symbolic value, and need not be tested recursively.
  4192. There are no property aliases or property value aliases of the form
  4193. "isisisisistooconvoluted" defined just to test implementation edge cases.</p>
  4194. <p>Existing and future property aliases and property value
  4195. aliases are guaranteed to be unique within their relevant namespaces, even
  4196. if an initial prefix string "is" is ignored. The existing cases of note
  4197. for aliases that do start with "is" are: dt=Iso (Decomposition_Type=Isolated)
  4198. and lb=IS. The Decomposition_Type value alias does not cause any problem,
  4199. because there is no contrasting value alias dt=o (Decomposition_Type=olated).
  4200. For lb=IS, note that the "IS" is the <i>entire</i> property value alias, and
  4201. is not a prefix. There is no null value for the Line_Break property for it
  4202. to contrast with, but implementations of loose matching should be careful
  4203. of this edge case, so that "lb=IS" is not misinterpreted as matching a null
  4204. value.</p>
  4205. <p>Implementations sometimes use other syntactic constructs
  4206. that interact with loose matching. For example, the property matching
  4207. expression \p{L} may be defaulted to refer to the Unicode General_Category
  4208. property: \p{General_Category=L}. For more information about
  4209. the use of property values in regular expressions and other environments,
  4210. see <i>Section 1.2, Properties</i>, in Unicode Technical Standard #18,
  4211. "Unicode Regular Expressions" [<a href="../tr41/tr41-21.html#UTS18">UTS18</a>].</p>
  4212. <h3>5.10 <a name="Invariants" href="#Invariants">Invariants</a></h3>
  4213. <p>Property values in the UCD may be subject to correction
  4214. in subsequent versions of the standard, as errors are found. Furthermore, any
  4215. new version of the Unicode Standard may introduce new property values for
  4216. a given property, except where the set of allowable values is fixed
  4217. by the property type (such as for binary properties), or where the
  4218. set of allowable values is subject to a provision of the Unicode
  4219. Character Encoding Stability Policy [<a href="../tr41/tr41-21.html#Stability">Stability</a>].
  4220. Finally, a new version may also
  4221. introduce new properties or new data files in the UCD.</p>
  4222. <p>Implementers of the UCD need to be aware of
  4223. such changes when updating to new versions. However, some property values
  4224. and some aspects of the file formats are considered
  4225. invariant. This section documents such invariants.</p>
  4226. <h4>5.10.1 <a name="Property_Invariants" href="#Property_Invariants">Character Property Invariants</a></h4>
  4227. <p>All formally guaranteed invariants for properties or property values
  4228. are described in
  4229. the Unicode Character Encoding Stability Policy
  4230. [<a href="../tr41/tr41-21.html#Stability">Stability</a>].
  4231. That policy and the list of invariants it enumerates are
  4232. maintained outside the context of the Unicode Standard per se.
  4233. They are not part of the standard, but rather are constraints
  4234. on what can and cannot change in the standard between versions,
  4235. and on what decisions the Unicode Technical Committee can and
  4236. cannot take regarding the standard.</p>
  4237. <p>In addition to the formally guaranteed invariants described
  4238. in the Unicode Character Encoding Stability Policy, this section
  4239. notes a few additional points regarding character property
  4240. invariants in the UCD.</p>
  4241. <p>Some character properties are simply considered <i>immutable</i>: once
  4242. assigned, they are never changed. For example, a character's name
  4243. is immutable, because of its importance in exact identification
  4244. of the character. The Canonical_Combining_Class and
  4245. Decomposition_Mapping of a character are immutable, because of their
  4246. importance to the stability of the Unicode Normalization Algorithm
  4247. [<a href="../tr41/tr41-21.html#UAX15">UAX15</a>].</p>
  4248. <p>The list of immutable character properties is shown in
  4249. <i>Table 19</i>.</p>
  4250. <p class="caption">Table 19. <a name="Immutable_Properties_Table" href="#Immutable_Properties_Table">Immutable Properties</a></p>
  4251. <div align="center">
  4252. <table class="simple">
  4253. <tr>
  4254. <th>Property Name</th>
  4255. <th>Abbr Name</th>
  4256. <th>Default Value</th>
  4257. <th>Assignable to New?</th>
  4258. </tr>
  4259. <tr>
  4260. <td>Age</td>
  4261. <td>Age</td>
  4262. <td>Unassigned</td>
  4263. <td>Yes</td>
  4264. </tr>
  4265. <tr>
  4266. <td>Name</td>
  4267. <td>na</td>
  4268. <td>null string</td>
  4269. <td>Yes</td>
  4270. </tr>
  4271. <tr>
  4272. <td>Name_Alias</td>
  4273. <td>Name_Alias</td>
  4274. <td>null string</td>
  4275. <td>Yes (see note)</td>
  4276. </tr>
  4277. <tr>
  4278. <td>Jamo_Short_Name</td>
  4279. <td>jsn</td>
  4280. <td>null string</td>
  4281. <td>No</td>
  4282. </tr>
  4283. <tr>
  4284. <td>Canonical_Combining_Class</td>
  4285. <td>ccc</td>
  4286. <td>0</td>
  4287. <td>Yes</td>
  4288. </tr>
  4289. <tr>
  4290. <td>Decomposition_Mapping</td>
  4291. <td>dm</td>
  4292. <td>&lt;code point&gt;</td>
  4293. <td>Yes</td>
  4294. </tr>
  4295. <tr>
  4296. <td>Pattern_Syntax</td>
  4297. <td>Pat_Syn</td>
  4298. <td>No</td>
  4299. <td>No</td>
  4300. </tr>
  4301. <tr>
  4302. <td>Pattern_White_Space</td>
  4303. <td>Pat_WS</td>
  4304. <td>No</td>
  4305. <td>No</td>
  4306. </tr>
  4307. <tr>
  4308. <td>Noncharacter_Code_Point</td>
  4309. <td>NChar</td>
  4310. <td>No</td>
  4311. <td>No</td>
  4312. </tr>
  4313. </table>
  4314. </div>
  4315. <p>If a property has "Yes" in the "Assignable to New?" column
  4316. in <i>Table 19</i>, that means that the property value is immutable once
  4317. it is initially assigned to a newly encoded character. The value for a
  4318. reserved code point takes the default value, as shown
  4319. in the third column of the table, but <i>may change</i> from the default value
  4320. once the character is encoded. On the other hand, if a property has "No"
  4321. in the "Assignable to New?" column, that means that it is <i>absolutely</i>
  4322. immutable: all code points, including reserved code points, have a specific
  4323. property value assigned, and that value does not change if a new character
  4324. is encoded at a particular reserved code point in a future version of the
  4325. standard.</p>
  4326. <p>The Name_Alias property is unusual, in that there can be more
  4327. than one formal name alias assigned to a given encoded character. The default
  4328. value for Name_Alias is the null string, but once any Name_Alias is assigned
  4329. to an encoded character, that value is immutable. If more than one formal
  4330. name alias is assigned to the same encoded character, each of those values is
  4331. immutable.</p>
  4332. <p>A set of binary character properties associated with identifiers have
  4333. a different kind of immutability, which can be described as <i>locked to Yes</i>.
  4334. This results from the way these properties are used in the specification of identifiers.
  4335. Unicode identifiers have the characteristic of stability between versions, so that
  4336. once a string is specified as belonging to a particular class of identifier, it must <i>stay</i>
  4337. in that class for future versions of the standard. Because of that requirement
  4338. for identifier stability, there are associated constraints on
  4339. how the related character properties can change. In particular, the identifier-related properties
  4340. listed in <i>Table 19a</i> may have their values for any particular assigned character
  4341. change from No to Yes between versions of the standard, but once a character has the
  4342. value Yes, that value is locked in, and cannot ever be changed back to No.</p>
  4343. <p class="caption">Table 19a. <a name="Yes_Locked_Properties_Table" href="#Yes_Locked_Properties_Table">Yes-Locked Properties</a></p>
  4344. <div align="center">
  4345. <table class="simple">
  4346. <tr>
  4347. <th>Property Name</th>
  4348. <th>Abbr Name</th>
  4349. <th>Default Value</th>
  4350. </tr>
  4351. <tr>
  4352. <td>ID_Start</td>
  4353. <td>IDS</td>
  4354. <td>No</td>
  4355. </tr>
  4356. <tr>
  4357. <td>ID_Continue</td>
  4358. <td>IDC</td>
  4359. <td>No</td>
  4360. </tr>
  4361. <tr>
  4362. <td>XID_Start</td>
  4363. <td>XIDS</td>
  4364. <td>No</td>
  4365. </tr>
  4366. <tr>
  4367. <td>XID_Continue</td>
  4368. <td>XIDC</td>
  4369. <td>No</td>
  4370. </tr>
  4371. </table>
  4372. </div>
  4373. <p>In some cases, a property is not immutable, but the list
  4374. of possible values that it can have is considered
  4375. invariant. For example, while at least some General_Category
  4376. values are subject to change and correction, the enumerated set
  4377. of possible values that the General_Category property can have
  4378. is fixed and cannot be added to in the future. However, not all Enumeration
  4379. properties used by Unicode algorithms have immutable lists of
  4380. property values. For example, the enumerated lists of values
  4381. associated with the Line_Break and the Word_Break properties have
  4382. changed in the past, and may be changed again in future versions
  4383. of the standard.</p>
  4384. <p>All characters other than
  4385. those of General_Category M* are guaranteed to have Canonical_Combining_Class=0.
  4386. Currently it is also true that all characters
  4387. other than those of General_Category Mn have Canonical_Combining_Class=0.
  4388. However, the more constrained statement is not a guaranteed invariant;
  4389. it is possible that some new character of
  4390. General_Category Me or Mc could be given a non-zero value for
  4391. Canonical_Combining_Class in the future.</p>
  4392. <p>In Unicode 4.0 and thereafter, the General_Category value
  4393. <i>Decimal_Number</i> (Nd), and
  4394. the Numeric_Type value <i>Decimal</i> (de) are defined to be co-extensive;
  4395. that is, the set of
  4396. characters having General_Category=Nd will always be the same as the
  4397. set of characters having NumericType=de.</p>
  4398. <h4>5.10.2 <a name="File_Invariants" href="#File_Invariants">UCD File Format Invariants</a></h4>
  4399. <p>There are also some constraints on allowable change in the
  4400. file formats for UCD files. In general, the
  4401. <a href="#Format_Conventions">file format conventions</a> are
  4402. changed as little as possible, to minimize the impact on
  4403. implementations which parse the machine-readable data files.
  4404. However, some of the constraints on allowable file format
  4405. change go beyond conservatism in format and instead have
  4406. the status of invariants. These guarantees apply in particular
  4407. to UnicodeData.txt, the very first data file associated with
  4408. the UCD.</p>
  4409. <p>The number and order of the fields in UnicodeData.txt is fixed.
  4410. Any additional information about character properties to be added
  4411. to the UCD in the future will
  4412. appear in separate data files, rather than being added as an
  4413. additional field to UnicodeData.txt or by reinterpretation
  4414. of any of the existing fields.</p>
  4415. <h4>5.10.3 <a name="Invariants_in_Implementations" href="#Invariants_in_Implementations">Invariants in Implementations</a></h4>
  4416. <p>Applications may wish to take the various character property
  4417. and file format
  4418. invariants into account when choosing how to implement character properties.</p>
  4419. <p>The Canonical_Combining_Class offers a good example. The
  4420. character property invariants regarding Canonical_Combining_Class
  4421. guarantee that values, once assigned, will never change, and
  4422. that all values used will be in the range 0..254. This means
  4423. that the Canonical_Combining_Class can be safely implemented
  4424. in an unsigned byte and that any value stored in a table for
  4425. an existing character will not need to be updated dynamically
  4426. for a later version.</p>
  4427. <p>In practice, for Canonical_Combining_Class far fewer
  4428. than 256 values are used. Unicode 3.0 used 53 values;
  4429. Unicode 3.1 through Unicode 4.1 used 54 values; and Unicode 5.0
  4430. through Unicode 9.0 used 55 values. New, non-zero
  4431. Canonical_Combining_Class values are seldom added to the standard.
  4432. (For details about this history, see
  4433. <a href="#DerivedCombiningClass.txt">DerivedCombiningClass.txt</a>.)
  4434. Implementations may take advantage of this fact for compression,
  4435. because only the ordering of
  4436. the non-zero values, and not their absolute values, matters for
  4437. the Canonical Ordering Algorithm. In principle, it would be
  4438. possible for up to 255 values to be used in the future, but
  4439. the chances of the actual number of values exceeding 128
  4440. are remote at this point. There are implementation advantages
  4441. in restricting the number of internal class values to
  4442. 128&#x2014;for example, the ability to use signed bytes without
  4443. implicit widening to ints in Java.</p>
  4444. <h3>5.11 <a name="Validation" href="#Validation">Validation</a></h3>
  4445. <p>The Unicode character
  4446. property values in the UCD files can be validated by means of regular
  4447. expressions. Such validation can also be useful in testing of
  4448. implementations that return property values. The method of validation
  4449. depends on the type of property, as described below.
  4450. These expressions use Perl syntax, but may
  4451. of course be converted to other formal conventions for use
  4452. with other regular expression engines.</p>
  4453. <p>The regular expressions which are appropriate for validation
  4454. of particular properties may change in each subsequent version of the UCD.
  4455. However, because of stability guarantees for character property aliases, these
  4456. regular expressions for one version of
  4457. the Unicode Standard will match valid values for previous versions
  4458. of the standard.</p>
  4459. <h4>5.11.1 <a name="Validation_of_Enumerated" href="#Validation_of_Enumerated">Enumerated and Binary Properties</a></h4>
  4460. <p>Enumerated and binary character properties can be validated by
  4461. generating a regular expression using the PropertyValueAliases.txt file. Because
  4462. enumerated properties have a defined list of possible values, the validating
  4463. regular expression simply ORs together all of the possible values. Binary properties
  4464. are a special case of enumerated property, with a predefined very short
  4465. list of possible values.</p>
  4466. <p>For example, to validate the East_Asian_Width property in
  4467. the UCD, or to test an implementation that returns the East_Asian_Width property,
  4468. parse the following relevant lines from PropertyValueAliases.txt and produce a
  4469. regular expression that concatenates each of the short and long property alias
  4470. values.</p>
  4471. <blockquote>
  4472. <pre>
  4473. # East_Asian_Width (ea)
  4474. ea ; A ; Ambiguous
  4475. ea ; F ; Fullwidth
  4476. ea ; H ; Halfwidth
  4477. ea ; N ; Neutral
  4478. ea ; Na ; Narrow
  4479. ea ; W ; Wide
  4480. </pre>
  4481. </blockquote>
  4482. <p>The resulting regular expression would then be:</p>
  4483. <blockquote>
  4484. <pre>
  4485. /A|Ambiguous|F|Fullwidth|H|Halfwidth|N|Neutral|Na|Narrow|W|Wide/
  4486. </pre>
  4487. </blockquote>
  4488. <p>For each Unicode binary character property, the regular
  4489. expression can be precomputed simply as:</p>
  4490. <blockquote>
  4491. <pre>
  4492. /N|No|F|False|Y|Yes|T|True/
  4493. </pre>
  4494. </blockquote>
  4495. <p>The Catalog properties, Age, Block, and Script, are another
  4496. type of enumerated character property. All possible values of those properties
  4497. for any given version of the Unicode Standard are listed in PropertyValueAliases.txt,
  4498. so a validating regular expression for a Catalog property for that given version of the UCD can be
  4499. generated by concatenating values, as for the other enumerated properties.</p>
  4500. <h4>5.11.2 <a name="Validation_of_CCC" href="#Validation_of_CCC">Combining_Character_Class Property</a></h4>
  4501. <p>The Combining_Character_Class (ccc) property is a hybrid type. The
  4502. possible values defined for it in UnicodeData.txt range from 0 to 254 and are numeric
  4503. values. However, Combining_Character_Class also has symbolic aliases defined for those particular values
  4504. that are in actual use; those symbolic aliases are listed in PropertyValueAliases.txt.
  4505. To produce a validating regular expression for Combining_Character_Class, concatenate
  4506. together the symbolic aliases from PropertyValueAliases.txt, and then add the numeric
  4507. range 0..254.</p>
  4508. <p>The value 255 is reserved for use by implementations. When the
  4509. ccc values are represented by bytes, that additional value of 255 may be used
  4510. by an implementation for other purposes.</p>
  4511. <p>The value 133 is reserved. No characters have that value. The property value alias
  4512. CCC133 is retained in accordance with the stability policy regarding property value aliases.</p>
  4513. <h4>5.11.3 <a name="Validation_of_Unihan" href="#Validation_of_Unihan">Unihan Properties</a></h4>
  4514. <p>The validating regular expressions for each property tag defined
  4515. in the Unihan database are described in detail in [<a href="../tr41/tr41-21.html#UAX38">UAX38</a>].</p>
  4516. <h4>5.11.4 <a name="Validation_of_Other" href="#Validation_of_Other">Other Properties</a></h4>
  4517. <p>Regular expressions to validate String and Miscellaneous properties
  4518. in the UCD are provided in <i>Table 21</i>. Although Catalog properties may use
  4519. strict tests, as described in <i>Section 5.11.1 <a href="#Validation_of_Enumerated">Enumerated and Binary Properties</a></i>,
  4520. generic patterns for Block
  4521. and Script are also provided in <i>Table 21</i>.</p>
  4522. <p>To simplify the
  4523. presentation of these expressions, commonly occurring subexpressions are first
  4524. abstracted out as variables defined in <i>Table 20</i>.</p>
  4525. <p class="caption">Table 20. <a name="Common_Subexpressions_Table" href="#Common_Subexpressions_Table">Common Subexpressions for Validation</a></p>
  4526. <div align="center">
  4527. <table class="simple">
  4528. <tr>
  4529. <th>Variable</th>
  4530. <th>Value</th>
  4531. <th>Notes and Examples</th>
  4532. </tr>
  4533. <tr>
  4534. <td>$digit</td>
  4535. <td>[0-9]</td>
  4536. <td>"0", "3"</td>
  4537. </tr>
  4538. <tr>
  4539. <td>$hexDigit</td>
  4540. <td>[0-9A-F]</td>
  4541. <td>"1", "A"</td>
  4542. </tr>
  4543. <tr>
  4544. <td>$alphaNum</td>
  4545. <td>[0-9A-Za-z]</td>
  4546. <td>"1", "A", "z"</td>
  4547. </tr>
  4548. <tr>
  4549. <td>$digits</td>
  4550. <td>$digit+</td>
  4551. <td>"0", "12345"</td>
  4552. </tr>
  4553. <tr>
  4554. <td>$label</td>
  4555. <td>$alphaNum+</td>
  4556. <td>"A", "Syriac", "NGKWAEN", "123467", "A005A"</td>
  4557. </tr>
  4558. <tr>
  4559. <td>$positiveDecimal</td>
  4560. <td>$digits\.$digits</td>
  4561. <td>"3.1"</td>
  4562. </tr>
  4563. <tr>
  4564. <td>$decimal</td>
  4565. <td>-?$positiveDecimal</td>
  4566. <td>"3.5", "-0.5"</td>
  4567. </tr>
  4568. <tr>
  4569. <td>$rational</td>
  4570. <td>-?$digits(/$digits)?</td>
  4571. <td>"3/4", "-3/4"</td>
  4572. </tr>
  4573. <tr>
  4574. <td>$optionalDecimal</td>
  4575. <td>-?$digits(\.$digits)?</td>
  4576. <td>"3.5", "-0.5", "2", "1000"</td>
  4577. </tr>
  4578. <tr>
  4579. <td>$name</td>
  4580. <td>$label(( -|- |[-_ ])$label)*</td>
  4581. <td>name, with potential non-medial hyphens</td>
  4582. </tr>
  4583. <tr>
  4584. <td>$name2</td>
  4585. <td>$label([-_ ]$label)*</td>
  4586. <td>name, no non-medial hyphens allowed</td>
  4587. </tr>
  4588. <tr>
  4589. <td>$annotatedName</td>
  4590. <td>$name2( \(.*\))?</td>
  4591. <td>name with optional parenthetical annotation</td>
  4592. </tr>
  4593. <tr>
  4594. <td>$shortName</td>
  4595. <td>[A-Z]{0,3}</td>
  4596. <td>"", "O", "WA", "WAE"</td>
  4597. </tr>
  4598. <tr>
  4599. <td>$codePoint</td>
  4600. <td>(10|$hexDigit)?$hexDigit{4}</td>
  4601. <td>"00A0", "E0100", "10FFFF"</td>
  4602. </tr>
  4603. <tr>
  4604. <td>$codePoints</td>
  4605. <td>$codePoint(\s$codePoint)*</td>
  4606. <td>space-delimited list of 1 to n code points</td>
  4607. </tr>
  4608. <tr>
  4609. <td>$codePoint0</td>
  4610. <td>($codePoints)?</td>
  4611. <td>space-delimited list of 0 to n code points</td>
  4612. </tr>
  4613. </table>
  4614. </div>
  4615. <p>The regular expressions listed in <i>Table 21</i> cover
  4616. all the straightforward cases for other property values. For properties
  4617. involving somewhat more irregular values, such as <a href="#Age">Age</a>,
  4618. <a href="#ISO_Comment">ISO_Comment</a>, and <a href="#Unicode_1_Name">Unicode_1_Name</a>,
  4619. details for validation can be found in [<a href="../tr41/tr41-21.html#UAX42">UAX42</a>].</p>
  4620. <p class="caption">Table 21. <a name="Regular_Expressions_Table" href="#Regular_Expressions_Table">Regular Expressions for Other Property Values</a></p>
  4621. <div align="center">
  4622. <table class="simple">
  4623. <tr>
  4624. <th>Abbr</th>
  4625. <th>Name</th>
  4626. <th colspan="2">Regex for Allowable Values</th>
  4627. </tr>
  4628. <tr>
  4629. <td rowspan="3">nv</td>
  4630. <td rowspan="3">Numeric_Value</td>
  4631. <td>/$decimal/</td>
  4632. <td>Field 2</td>
  4633. </tr>
  4634. <tr>
  4635. <td>/$optionalDecimal/</td>
  4636. <td>Field 3</td>
  4637. </tr>
  4638. <tr>
  4639. <td colspan="2">/$rational/</td>
  4640. </tr>
  4641. <tr>
  4642. <td>blk</td>
  4643. <td>Block</td>
  4644. <td rowSpan="2" colspan="2">/$name2/</td>
  4645. </tr>
  4646. <tr>
  4647. <td>sc</td>
  4648. <td>Script</td>
  4649. </tr>
  4650. <tr>
  4651. <td>dm</td>
  4652. <td>Decomposition_Mapping</td>
  4653. <td rowSpan="2" colspan="2">
  4654. /$codePoints/</td>
  4655. <tr>
  4656. <td>FC_NFKC</td>
  4657. <td>FC_NFKC_Closure</td>
  4658. </tr>
  4659. <tr>
  4660. <td>NFKC_CF</td>
  4661. <td>NFKC_Casefold</td>
  4662. <td colspan="2">/$codePoint0/</td>
  4663. </tr>
  4664. <tr>
  4665. <td>cf</td>
  4666. <td>Case_Folding</td>
  4667. <td rowSpan="4" colspan="2">
  4668. /$codePoints/</td>
  4669. </tr>
  4670. <tr>
  4671. <td>lc</td>
  4672. <td>Lowercase_Mapping</td>
  4673. </tr>
  4674. <tr>
  4675. <td>tc</td>
  4676. <td>Titlecase_Mapping</td>
  4677. </tr>
  4678. <tr>
  4679. <td>uc</td>
  4680. <td>Uppercase_Mapping</td>
  4681. </tr>
  4682. <tr>
  4683. <td>scf</td>
  4684. <td>Simple_Case_Folding</td>
  4685. <td rowSpan="4" colspan="2">
  4686. /$codePoint/</td>
  4687. </tr>
  4688. <tr>
  4689. <td>slc</td>
  4690. <td>Simple_Lowercase_Mapping</td>
  4691. </tr>
  4692. <tr>
  4693. <td>stc</td>
  4694. <td>Simple_Titlecase_Mapping</td>
  4695. </tr>
  4696. <tr>
  4697. <td>suc</td>
  4698. <td>Simple_Uppercase_Mapping</td>
  4699. </tr>
  4700. <tr>
  4701. <td>bmg</td>
  4702. <td>Bidi_Mirroring_Glyph</td>
  4703. <td colspan="2">/$codePoint/</td>
  4704. </tr>
  4705. <tr>
  4706. <td>na</td>
  4707. <td>Name</td>
  4708. <td rowspan="3" colspan="2">/$name/</td>
  4709. </tr>
  4710. <tr>
  4711. <td>Name_Alias</td>
  4712. <td>Name_Alias</td>
  4713. </tr>
  4714. <tr>
  4715. <td>--</td>
  4716. <td>Names for named sequences*</td>
  4717. </tr>
  4718. <tr>
  4719. <td>na1</td>
  4720. <td>Unicode_1_Name</td>
  4721. <td colspan="2">/$annotatedName/</td>
  4722. </tr>
  4723. <tr>
  4724. <td>JSN</td>
  4725. <td>Jamo_Short_Name</td>
  4726. <td colspan="2">/$shortName/</td>
  4727. </tr>
  4728. </table>
  4729. </div>
  4730. <blockquote>
  4731. <p>* The names for Unicode named character sequences are not formally Unicode
  4732. character property values. However, they follow the same syntax as the Name and Name_Alias
  4733. property values.</p>
  4734. </blockquote>
  4735. <h4>5.11.5 <a name="Validation_of_Multivalued" href="#Validation_of_Multivalued">Validation of Multivalued Properties</a></h4>
  4736. <p>Some properties, such as Script_Extensions of kCantonese, have property
  4737. values each consisting of a set of element values. In the data files, these element values
  4738. are separated by spaces. Validation of the property values is performed by first splitting
  4739. each set into element values at the spaces, and then validating each element value
  4740. individually. For example, the elements for Script_Extensions are values of the
  4741. Script property; they are validated according to the validation requirements for the
  4742. Script property. See also Section 5.7.6 <a href="#Property_Values_As_Sets">Properties Whose Values Are Sets of Values</a>.</p>
  4743. <p>The Name_Alias property has values which consist of sets of one or
  4744. more name strings. In the data file for this property, each element value occurs on
  4745. a separate line and can be validated as a separate element.</p>
  4746. <h3>5.12 <a name="Deprecation" href="#Deprecation">Deprecation</a></h3>
  4747. <p>In the Unicode Standard, the term <i>deprecation</i> is used somewhat
  4748. differently than it is in some other standards. Deprecation is used to
  4749. mean that a character or other feature is strongly discouraged from use.
  4750. This should not, however, be taken as indicating that anything has been
  4751. removed from the standard, nor that anything is <i>planned</i> for removal
  4752. from the standard. Any such change is constrained by the
  4753. Unicode Consortium Stability Policies [<a href="../tr41/tr41-21.html#Stability">Stability</a>].</p>
  4754. <p>For the Unicode Character Database, there are two important types
  4755. of deprecation to be noted. First, an <i>encoded character</i> may be
  4756. deprecated. Second, a <i>character property</i> may be deprecated.</p>
  4757. <p>When an encoded character is strongly discouraged from use, it is
  4758. given the property value Deprecated=True. The <a href="#Deprecated">Deprecated</a> property
  4759. is a binary property defined specifically to carry this information about
  4760. Unicode characters. Very few characters are ever formally
  4761. deprecated this way; it is not enough that a character be uncommon, obsolete,
  4762. disliked, or not preferred. Only those few characters which have been
  4763. determined by the UTC to have serious architectural defects or which
  4764. have been determined to cause significant implementation problems are
  4765. ever deprecated. Even in the most severe cases, such as the
  4766. deprecated format control characters (U+206A..U+206F), an encoded character
  4767. is <i>never</i> removed from the standard. Furthermore, although deprecated
  4768. characters are strongly discouraged from use, and should be avoided in
  4769. favor of other, more appropriate mechanisms, they <i>may</i> occur in data.
  4770. Conformant implementations of Unicode processes such a Unicode normalization <i>must</i>
  4771. handle even deprecated characters correctly.</p>
  4772. <p>In the Unicode Character Database, a character property may
  4773. also become strongly discouraged&#x2014;usually because it no longer
  4774. serves the purpose it was originally defined for. In such cases, the
  4775. property is labelled "deprecated" in
  4776. <i>Table 9, <a href="#Property_List_Table">Property Table</a></i>.
  4777. For example, see the <a href="#Grapheme_Link">Grapheme_Link</a> property.
  4778. Deprecated properties are not recommended for
  4779. exposure in public APIs that support Unicode character properties.</p>
  4780. <h3>5.13 <a name="Property_APIs" href="#Property_APIs">Property APIs</a></h3>
  4781. <p>The Unicode Standard does not specify the exact form of APIs which may be defined
  4782. in software libraries to surface Unicode character properties to applications. However, there
  4783. are some recommendations and general guidelines to follow, which should serve to reduce
  4784. potential confusion and to promote better interoperability between applications using
  4785. the Unicode Character Database.</p>
  4786. <p>In the discussion which follows here, the term <i>API</i> is
  4787. used to refer to a particular function or method, whereas the term <i>API collection</i>
  4788. is used to refer to a related group of APIs, which might constitute a set of functions
  4789. exported from a library, a class definition, or other groupings of related functionality.
  4790. A distinction is also made between a <i>public API</i>, which is exported for general
  4791. application use, and a <i>private API</i>, which may be kept hidden within a library or
  4792. class, intended for internal use.</p>
  4793. <p>First, if an API surfaces values of a particular Unicode character property
  4794. and <i>purports</i> that value to represent a Unicode character property, it should exactly
  4795. follow the specification of that property in the UCD. This principle follows from the
  4796. general approach to conformance for the Unicode Standard: If you say it is Unicode,
  4797. then it should follow the Unicode Standard specification.</p>
  4798. <p>Second, an API should be clear about which version of the UCD it
  4799. supports. This can be done, for example, with documentation, either external or
  4800. included in the source in header files, class definition notes, and so forth.
  4801. For an API collection, an even better option is to include an API which explicitly
  4802. reports which version of the UCD is supported.
  4803. This provision should reduce confusion regarding particular property
  4804. values which might change between versions of the Unicode Standard, as well as making
  4805. it clear which repertoire of encoded characters is intended to be covered. There is
  4806. no principled constraint on an API supporting <i>more than one</i> version of the UCD, as long
  4807. as it is clear about how it does so.</p>
  4808. <p>Third, although there is no constraint on an API declaring that it
  4809. only supports a designated subset of Unicode characters, best practice for a general
  4810. purpose character property API would be to support the entire range of Unicode
  4811. code points, providing determinant and well-documented property values for any valid Unicode
  4812. code point input. That would include providing correct default property values for
  4813. any unassigned code point. See <i>Section 2.2, <a href="#Use_Default">Use of Default Values</a></i>
  4814. for an explanation of that concept.</p>
  4815. <p>Fourth, a Unicode character property API is not precluded from
  4816. extending or tailoring its support of character properties, as long as such
  4817. behavior is clearly documented, so that applications understand the values they
  4818. will be getting by calling the API. For example, an API might surface an
  4819. extended new property such as IsDanda, which is not formally part of the
  4820. properties specified by the UCD, but which can be inferred from the
  4821. documentation of the Unicode Standard. An API supporting a particular
  4822. tailoring of the Unicode Line Breaking Algorithm could surface tailored
  4823. Line_Break property values to support that behavior. Alternatively, an API supporting
  4824. a particular private use agreement could surface privately-defined properties
  4825. for a designated range of PUA characters. All such use of APIs should be
  4826. considered conformant ways of extending API collections using the UCD.</p>
  4827. <p>Designers of API collections to support Unicode character properties must
  4828. also be aware that not all Unicode character properties are equal. There is no
  4829. requirement, express or implied, that <i>all</i> Unicode character properties
  4830. should be supported in a given API collection. In fact, an approach that simply parses
  4831. the UCD and surfaces <i>all</i> Unicode character properties verbatim is
  4832. very likely to result in a bad design. Character properties need to be
  4833. understood in the context of the various Unicode algorithms they are designed
  4834. to support.</p>
  4835. <p>The following subtypes of
  4836. Unicode character properties should generally <i>not</i> be exposed in APIs,
  4837. except in limited circumstances. They may not be useful, particularly
  4838. in public API collections, and may instead prove misleading to the users
  4839. of such API collections.</p>
  4840. <ul>
  4841. <li><i><a href="#Contributory_Properties">Contributory properties</a></i> are not recommended for public APIs.</li>
  4842. <li>A subset of Unicode normalization-related properties are not recommended for public APIs. See
  4843. <i>Section 5.7.5, <a href="#Decompositions_and_Normalization">Decompositions and Normalization</a></i>.</li>
  4844. <li>Deprecated properties are not recommended for public APIs. See
  4845. <i>Section 5.12, <a href="#Deprecation">Deprecation</a></i>.</li>
  4846. </ul>
  4847. <h3>5.14 <a name="Character_Age" href="#Character_Age">Character Age</a></h3>
  4848. <p>The <a href="#Age">Age</a> property indicates the first version in which a
  4849. particular Unicode character was assigned. For example, U+20AC &#x20AC; EURO SIGN was
  4850. added to Version 2.1 of the Unicode Standard, so it has age=2.1, while
  4851. U+20B9 &#x20B9; INDIAN RUPEE SIGN was added to Version 6.0 of the Unicode Standard,
  4852. so it has age=6.0.</p>
  4853. <p>The short values for the Age property for assigned (designated) code points are of the form &quot;m.n&quot;,
  4854. with the first field corresponding to the major version, and the second field corresponding
  4855. to the minor version. There is no need for a third version field, because new
  4856. characters are never assigned in update versions of
  4857. the standard. The long
  4858. values for the Age property for assigned code points start with a &quot;V&quot; and use an underscore instead
  4859. of a dot between the major and minor version numbers: V2_1, V6_0, and so on. This
  4860. makes the long format more useful as an identifier in programming languages. It is
  4861. also useful in regular expressions, where the dot has other significance.</p>
  4862. <p>The default value of the Age property, used for unassigned (undesignated) code points,
  4863. is expressed with labels that depart from the numerical versioning scheme
  4864. of the Age property for assigned code points; the short form for the default is &quot;NA&quot;,
  4865. and the long form for the default is &quot;Unassigned&quot;. Implementations of parsers
  4866. which manipulate the Age property need to be prepared for this special case,
  4867. rather than expecting the default value to be expressed numerically, as &quot;0.0&quot;, for example.</p>
  4868. <p>The Age property is
  4869. based on when a character is encoded in the standard. It is normative and immutable, and
  4870. cannot be meaningfully tailored.</p>
  4871. <p>The minimum value of the Age property is &quot;1.1&quot;,
  4872. instead of &quot;1.0&quot;, because of the substantial and
  4873. incompatible changes to the standard resulting from the merger of code points and
  4874. character names between the Unicode Standard and ISO/IEC 10646 for their 1993
  4875. publications. For Hangul syllable characters, which were
  4876. extensively augmented in Unicode 2.0, the Age value is set to &quot;2.0&quot;, even
  4877. though a subset of the Hangul syllables had been published in earlier versions,
  4878. at different code points.</p>
  4879. <p>Private use characters, noncharacter code points, and surrogate code
  4880. points also get Age values. The private use characters and noncharacter code
  4881. points on the BMP have age=1.1. However, the full architecture for UTF-16 and multiple planes
  4882. was not fully documented until Unicode 2.0, so the private use characters and
  4883. noncharacter code points on supplementary planes, as well as the surrogate
  4884. code points in the range D800..DFFF, are given the value age=2.0.</p>
  4885. <p>The Age property cannot be derived from the other
  4886. data files in any single version of the Unicode Character Database. Its derivation
  4887. is done, rather, by tools that compare the assigned characters <i>between</i>
  4888. subsequent versions. The data file <a href="#DerivedAge.txt">DerivedAge.txt</a>
  4889. provides the definitive listing of the
  4890. Age property value for all code points, as of that version of the standard.</p>
  4891. <p>The typical use case for the Age property in regular expressions
  4892. is to search for all characters that were
  4893. present in a given version. For this reason,
  4894. an expression such as &quot;\p{age=V3_0}&quot; is exceptionally
  4895. defined to match all of the code
  4896. points assigned in Version 3.0&#x2014;that is, all the code points with
  4897. a value <i>less than or equal to</i> the value 3.0 for the Age property, rather than
  4898. just the subset of those code points with the value 3.0. This interprets
  4899. &quot;\p{age=V3_0}&quot;
  4900. as the set of all characters assigned as of Unicode 3.0, rather than
  4901. as just the set of characters <i>added</i> to Unicode 3.0 subsequent to the
  4902. prior version. For more
  4903. information, see Unicode Technical Standard #18,
  4904. "Unicode Regular Expressions" [<a href="../tr41/tr41-21.html#UTS18">UTS18</a>].</p>
  4905. <h2>6 <a name="Test_Files" href="#Test_Files">Test Files</a></h2>
  4906. <p>The UCD contains a number of test data files.
  4907. Those provide data in standard formats which can be used to test
  4908. implementations of Unicode algorithms. The test data files
  4909. distributed with this version of the UCD are listed in
  4910. <i>Table 22</i>.</p>
  4911. <p class="caption">Table 22. <a name="Algorithm_Test_Table" href="#Algorithm_Test_Table">Unicode Algorithm Test Data Files</a></p>
  4912. <div align="center">
  4913. <table class="simple">
  4914. <tr>
  4915. <th>File Name</th>
  4916. <th>Specification</th>
  4917. <th>Status</th>
  4918. <th>Unicode Algorithm</th>
  4919. </tr>
  4920. <tr>
  4921. <td>BidiTest.txt</td>
  4922. <td>[<a href="../tr41/tr41-21.html#UAX9">UAX9</a>]</td>
  4923. <td style="text-align:center">N</td>
  4924. <td>Unicode Bidirectional Algorithm</td>
  4925. </tr>
  4926. <tr>
  4927. <td>BidiCharacterTest.txt</td>
  4928. <td>[<a href="../tr41/tr41-21.html#UAX9">UAX9</a>]</td>
  4929. <td style="text-align:center">N</td>
  4930. <td>Unicode Bidirectional Algorithm</td>
  4931. </tr>
  4932. <tr>
  4933. <td>NormalizationTest.txt</td>
  4934. <td>[<a href="../tr41/tr41-21.html#UAX15">UAX15</a>]</td>
  4935. <td style="text-align:center">N</td>
  4936. <td>Unicode Normalization Algorithm</td>
  4937. </tr>
  4938. <tr>
  4939. <td>LineBreakTest.txt</td>
  4940. <td>[<a href="../tr41/tr41-21.html#UAX14">UAX14</a>]</td>
  4941. <td style="text-align:center">N</td>
  4942. <td>Unicode Line Breaking Algorithm</td>
  4943. </tr>
  4944. <tr>
  4945. <td>GraphemeBreakTest.txt</td>
  4946. <td>[<a href="../tr41/tr41-21.html#UAX29">UAX29</a>]</td>
  4947. <td style="text-align:center">N</td>
  4948. <td>Grapheme Cluster Boundary Determination</td>
  4949. </tr>
  4950. <tr>
  4951. <td>WordBreakTest.txt</td>
  4952. <td>[<a href="../tr41/tr41-21.html#UAX29">UAX29</a>]</td>
  4953. <td style="text-align:center">N</td>
  4954. <td>Word Boundary Determination</td>
  4955. </tr>
  4956. <tr>
  4957. <td>SentenceBreakTest.txt</td>
  4958. <td>[<a href="../tr41/tr41-21.html#UAX29">UAX29</a>]</td>
  4959. <td style="text-align:center">N</td>
  4960. <td>Sentence Boundary Determination</td>
  4961. </tr>
  4962. </table>
  4963. </div>
  4964. <p>The normative status of these test files reflects their use to
  4965. determine the correctness of implementations claiming conformance
  4966. to the respective algorithms listed in the table. There is no
  4967. requirement that any particular Unicode implementation also
  4968. implement the Unicode Line Breaking Algorithm, for example, but
  4969. <i>if</i> it implements that algorithm correctly, it should be
  4970. able to replicate the test case results specified in the
  4971. data entries in LineBreakTest.txt.</p>
  4972. <h3>6.1 <a name="NormalizationTest_txt" href="#NormalizationTest_txt"> NormalizationTest.txt </a></h3>
  4973. <p>This file contains data which can be used to test an implementation of the
  4974. Unicode Normalization Algorithm.
  4975. (See [<a href="../tr41/tr41-21.html#UAX15">UAX15</a>] and [<a href="../tr41/tr41-21.html#Tests15">Tests15</a>].)</p>
  4976. <p>The data file has a Unicode string in the first field (which may consist
  4977. of just a single code point). The next four fields then specify the expected
  4978. output results of converting that string to Unicode Normalization Forms
  4979. NFC, NFD, NFKC, and NFKD, respectively. There are many tricky edge cases
  4980. included in the input data, to ensure that implementations have correctly
  4981. implemented some of the more complex subtleties of the Unicode Normalization
  4982. Algorithm.</p>
  4983. <p>The header section of NormalizationTest.txt provides additional information
  4984. regarding the normalization invariant relations that any conformant
  4985. implementation should be able to replicate.</p>
  4986. <p>The Unicode Normalization Algorithm is not tailorable. Conformant
  4987. implementations should be expected to produce results as specified in
  4988. NormalizationTest.txt and should not deviate from those results.</p>
  4989. <h3>6.2 <a name="Segmentation_Test_Files" href="#Segmentation_Test_Files">Segmentation Test Files and Documentation</a></h3>
  4990. <p>LineBreakTest.txt, located in the auxiliary directory of the UCD,
  4991. contains data which can be used
  4992. to test an implementation of the Unicode Line Breaking Algorithm.
  4993. (See [<a href="../tr41/tr41-21.html#UAX14">UAX14</a>] and [<a href="../tr41/tr41-21.html#Tests14">Tests14</a>].) The header of
  4994. that file specifies the data format and the use of the test data to
  4995. specify line break opportunities. Note that non-ASCII characters are used
  4996. in this test data as field delimiters.</p>
  4997. <p>There is an associated documentation file, LineBreakTest.html, which displays
  4998. the results of the Line Breaking Algorithm in an interactive chart form, with a
  4999. documented listing of the rules.</p>
  5000. <p>The Unicode text segmentation test data files are also located in the
  5001. auxiliary directory of the UCD. (See [<a href="../tr41/tr41-21.html#Tests29">Tests29</a>].) They
  5002. contain data which can be used to test an implementation of the segmentation
  5003. algorithms specified in [<a href="../tr41/tr41-21.html#UAX29">UAX29</a>].
  5004. The headers of
  5005. those file specify the data format and the use of the test data to
  5006. specify text segmentation opportunities. Note that non-ASCII characters are used
  5007. in this test data as field delimiters.</p>
  5008. <p>There are also associated documentation
  5009. files, which display the results of the segmentation algorithms in an
  5010. interactive chart form, with a documented listing of the rules:</p>
  5011. <ul>
  5012. <li>GraphemeBreakTest.html </li>
  5013. <li>SentenceBreakTest.html </li>
  5014. <li>WordBreakTest.html </li>
  5015. </ul>
  5016. <p>Unlike the Unicode Normalization Algorithm, the Unicode Line Breaking
  5017. Algorithm and the various text segmentation algorithms are tailorable,
  5018. and there is every expectation that implementations will tailor these
  5019. algorithms to produce results as needed. The test data files only test
  5020. the <i>default</i> behavior of the algorithms. Testing of tailored implementations
  5021. will need to modify and/or extend the test cases as appropriate to match
  5022. any documented tailoring.</p>
  5023. <h3>6.3 <a name="BidiTest_txt" href="#BidiTest_txt">Bidirectional Test Files</a></h3>
  5024. <p>These files contain data
  5025. which can be used to test an implementation of the
  5026. Unicode Bidirectional Algorithm.
  5027. (See [<a href="../tr41/tr41-21.html#UAX9">UAX9</a>] and [<a href="../tr41/tr41-21.html#Tests9">Tests9</a>].)</p>
  5028. <p>The data in BidiTest.txt is intended to exhaustively test
  5029. all possible combinations of Bidi_Class values for strings of length four or less.
  5030. To allow for the resulting very large number of test cases,
  5031. the data file has a somewhat complicated format which is
  5032. described in the header. Fundamentally, for each input string and for each
  5033. possible input paragraph level, the test data specifies the resulting bidi levels and
  5034. expected reordering.</p>
  5035. <p>The data in BidiCharacterTest.txt is provided to test various
  5036. edge cases for the algorithm. It contains an extra field which allows for explicit
  5037. control of the overall directional context for each test case.</p>
  5038. <p>The Unicode Bidirectional Algorithm is tailorable within certain limits.
  5039. Conformant implementations with no tailoring are expected to produce the results as
  5040. specified in BidiTest.txt and BidiCharacter.txt, and should not deviate from those results. Tailored
  5041. implementations can also use the data in
  5042. the test files to test for overall conformance
  5043. to the algorithm by changing the assignment of properties to characters to reflect
  5044. the details of their tailoring.</p>
  5045. <h2>7 <a name="Change_History" href="#Change_History">UCD Change History</a></h2>
  5046. <p>This section summarizes the recent
  5047. changes to the UCD&#x2014;including its documentation files&#x2014;and
  5048. is organized by Unicode versions.</p>
  5049. <p>References in the change history
  5050. are often made to a Public Review Issue (PRI). See
  5051. <a href="http://www.unicode.org/review/resolved.html">
  5052. http://www.unicode.org/review/resolved.html</a> for more information about
  5053. each of those cases.</p>
  5054. <hr>
  5055. <h3><a name="Unicode_10.0.0" href="http://www.unicode.org/versions/components-10.0.0.html">Unicode 10.0.0</a></h3>
  5056. <p><b>Changes in specific files:</b></p>
  5057. <p>New data files were added to the UCD: two primary files, NushuSources.txt and VerticalOrientation.txt,
  5058. documented in this section, as well as an extracted file, DerivedName.txt, generated from UnicodeData.txt.</p>
  5059. <p>The documentation file StandardizedVariants.html, already obsoleted as of Version 9.0,
  5060. was removed altogether from the UCD. Its function was superseded by the Unicode code charts and the emoji charts.</p>
  5061. <p>Appropriate existing data files were updated to add the 8,518 new characters encoded in Unicode 10.0,
  5062. which consist of 7,494 CJK unified ideographs and 1,024 other characters.
  5063. Major changes that are most likely to affect implementations are documented
  5064. in <a href="http://www.unicode.org/versions/Unicode10.0.0/#Migration">Section M of the Unicode 10.0.0 page</a>.
  5065. Detailed data file updates resulting from encoding the new characters and from various character
  5066. property changes are summarized below, in the same grouping manner used in
  5067. <a href="http://www.unicode.org/versions/components-10.0.0.html">Components of Unicode 10.0.0</a>.</p>
  5068. <p>Note that minor editorial updates and changes to the derived and extracted data files are not documented here.</p>
  5069. <h4>Core Data</h4>
  5070. <ul>
  5071. <li>ArabicShaping.txt
  5072. <ul>
  5073. <li>Entries were added for the 11 letters in the new Syriac Supplement block.
  5074. The letters, used for writing a dialect of the Malayalam language in the Syriac script (a form of Garshuni),
  5075. each have their own joining group, with schematic names that include the word MALAYALAM.</li>
  5076. </ul>
  5077. </li>
  5078. <li>Blocks.txt
  5079. <ul>
  5080. <li>Seven new blocks were added, including blocks for the four new scripts, Masaram Gondi, Nushu, Soyombo, and Zanabazar Square.</li>
  5081. <li>A Syriac Supplement block was added to the right-to-left area of the Basic Multilingual Plane.</li>
  5082. <li>A large collection of rare and historic CJK unified ideographs was added in a new block, CJK Unified Ideographs Extension F.
  5083. A set of 21 CJK unified ideographs was also added to the main CJK Unified Ideographs block.</li>
  5084. <li>A set of 285 Hentaigana characters was added to the existing Kana Supplement block and a new, adjacent block, Kana Extended-A.</li>
  5085. </ul>
  5086. </li>
  5087. <li>EastAsianWidth.txt
  5088. <ul>
  5089. <li>The following sets of newly encoded characters were assigned the East_Asian_Width property value Wide:
  5090. the new CJK unified ideographs; the Hentaigana characters; all Nushu characters, including the iteration mark, U+16FE1;
  5091. and one new Bopomofo letter, U+312E.</li>
  5092. <li>The 56 newly encoded pictographic symbols that have the Emoji_Presentation property as of Version 5.0 of
  5093. Unicode Technical Standard #51, "Unicode Emoji", were also assigned the East_Asian_Width property value Wide
  5094. [<a href="../tr41/tr41-21.html#UTS51">UTS51</a>].</li>
  5095. <li>All of the other new characters, including new symbols, were assigned the East_Asian_Width property value Neutral.</li>
  5096. </ul>
  5097. </li>
  5098. <li>EmojiSources.txt
  5099. <ul>
  5100. <li>There were no data additions or changes, but a comment was added to document that 11 mappings for keycap sequences
  5101. are historical and differ from the named character sequences with keycaps listed in NamedSequences.txt and the
  5102. corresponding UTS #51 emoji keycap sequences.</li>
  5103. </ul>
  5104. </li>
  5105. <li>IndicPositionalCategory.txt
  5106. <ul>
  5107. <li>Entries were added for the matras and non-vocalic marks of the three Brahmi-derived scripts introduced in
  5108. Unicode 10.0&#x2014;Masaram Gondi, Soyombo, and Zanabazar Square.</li>
  5109. <li>Entries were also added for new marks of existing Indic scripts, namely Gujarati and Malayalam, as well as a new Vedic mark.</li>
  5110. <li>The Indic_Positional_Category property value of U+A9BF JAVANESE CONSONANT SIGN CAKRA was corrected from Right to Bottom_And_Left.
  5111. The latter is a new Indic_Positional_Category property value, for which a new section was added to the file.</li>
  5112. </ul>
  5113. </li>
  5114. <li>IndicSyllabicCategory.txt
  5115. <ul>
  5116. <li>Characters in the three newly encoded Brahmi-derived scripts, as well as new characters of existing Indic scripts,
  5117. were added with appropriate property values.</li>
  5118. <li>The three newly encoded Gujarati nukta characters, U+0AFD..U+0AFF, were assigned the Indic_Syllabic_Category property
  5119. value Nukta, although their Canonical_Combining_Class property value was set to 0 (Not_Reordered) rather than 7 (Nukta).
  5120. Due to the increased number of exceptions, the derivation expression for the Indic_Syllabic_Category value Nukta
  5121. was removed from the comment lines in the file.</li>
  5122. <li>The classification of several previously encoded Tai Tham characters was revised based on expert feedback.</li>
  5123. <li>A few previously encoded Devanagari and Grantha nasalization signs were assigned the Indic_Syllabic_Category property value Bindu.</li>
  5124. <li>The documentation of a few syllabic categories was also expanded.</li>
  5125. </ul>
  5126. </li>
  5127. <li>LineBreak.txt
  5128. <ul>
  5129. <li>Newly encoded characters were assigned appropriate Line_Break property values.</li>
  5130. <li>The newly encoded U+20BF BITCOIN SIGN was assigned the Line_Break property value Prefix_Numeric,
  5131. the default for currency symbols.</li>
  5132. <li>Of the 56 newly encoded emoji symbols, the 16 which appear as bases in valid emoji modifier sequences or, equivalently,
  5133. have the Emoji_Modifier_Base property as of Version 5.0 of Unicode Technical Standard #51, "Unicode Emoji",
  5134. were assigned the Line_Break property value E_Base [<a href="../tr41/tr41-21.html#UTS51">UTS51</a>].
  5135. That value represents a change from the default value Line_Break=Ideographic for all unassigned code points in
  5136. the range U+1F000..U+1FFFD.
  5137. The other 40 new emoji were assigned the Line_Break property value Ideographic.</li>
  5138. <li>The new Typicon symbols in the range U+1F900..U+1F90B were assigned the Line Break property value Alphabetic.</li>
  5139. <li>Five previously encoded emoji symbols (U+1F3C2, U+1F3C7, U+1F3CC, U+1F574, and U+1F6CC) changed their
  5140. Line_Break property value from Ideographic to E_Base, because they are included in the set of bases
  5141. for valid emoji modifier sequences as of UTS #51 Version 5.0. (They were added to that set in UTR #51 Version 4.0.)</li>
  5142. <li>Conversely, two previously encoded emoji symbols (U+1F91D and U+1F93C) changed their Line_Break property value
  5143. from E_Base to Ideographic, because they no longer appear in valid emoji modifier sequences as of
  5144. UTS #51 Version 5.0. (They were removed from that set in UTR #51 Version 4.0.)</li>
  5145. <li>No other existing characters changed their Line_Break property values.</li>
  5146. </ul>
  5147. </li>
  5148. <li>NameAliases.txt
  5149. <ul>
  5150. <li>Four formal aliases of type "correction" were added for U+11EC..U+11EF, noting that "YESIEUNG" is the correct identification of the character component termed "IEUNG" in the character names.</li>
  5151. <li>One formal alias of type "correction" was added for U+1B001, to identify it as part of the complete hentaigana set.</li>
  5152. </ul>
  5153. </li>
  5154. <li>NamedSequences.txt
  5155. <ul>
  5156. <li>The set of 12 named character sequences used for emoji keycap sequences was moved from NamedSequencesProv.txt
  5157. to this file, as the named sequences advanced from provisional to approved status.</li>
  5158. </ul>
  5159. </li>
  5160. <li>NamedSequencesProv.txt
  5161. <ul>
  5162. <li>The set of 12 named character sequences used for emoji keycap sequences was moved from this file
  5163. to NamedSequences.txt, as the named sequences advanced from provisional to approved status.</li>
  5164. </ul>
  5165. </li>
  5166. <li>NamesList.txt
  5167. <ul>
  5168. <li>Content was updated throughout with new characters, as well as annotations, cross references, subheadings, and remarks.</li>
  5169. </ul>
  5170. </li>
  5171. <li>NushuSources.txt
  5172. <ul>
  5173. <li>This new data file was added to the UCD. It contains source mappings and readings for Nushu ideographs,
  5174. as well as radical-stroke data for the ideographs, in the same format as the Unihan data files and TangutSources.txt.</li>
  5175. </ul>
  5176. </li>
  5177. <li>PropertyAliases.txt
  5178. <ul>
  5179. <li>An entry was added for the enumerated property Vertical_Orientation, abbreviated vo, which was incorporated in the UCD.</li>
  5180. <li>An entry was added for the newly defined binary property, Regional_Indicator, abbreviated RI.</li>
  5181. </ul>
  5182. </li>
  5183. <li>PropertyValueAliases.txt
  5184. <ul>
  5185. <li>The 10.0 value, with the alias V10_0, was added to the catalog property Age.</li>
  5186. <li>Script and Block property values were listed for the four new scripts and seven new blocks introduced.</li>
  5187. <li>Entries were added for the 11 new Joining_Group property values introduced with the Malayalam Garshuni letters
  5188. in the new Syriac Supplement block.</li>
  5189. <li>New sections were added for the values of the newly defined binary property Regional_Indicator and
  5190. the enumerated property Vertical_Orientation.</li>
  5191. <li>An entry was added for a new Indic_Positional_Category property value, Bottom_And_Left.</li>
  5192. </ul>
  5193. </li>
  5194. <li>PropList.txt
  5195. <ul>
  5196. <li>Most of the newly encoded combining marks were assigned either the contributory property Other_Alphabetic
  5197. or the binary property Diacritic, as appropriate.</li>
  5198. <li>Newly encoded punctuation characters that mark the end of various sections of text, such as dandas,
  5199. were assigned the appropriate binary properties Terminal_Punctuation or Sentence_Terminal.</li>
  5200. <li>All 7,494 new CJK unified ideographs were assigned both the Ideographic and the Unified_Ideograph binary properties.</li>
  5201. <li>The newly encoded Nushu ideographs (which do not include the iteration mark U+16FE1) were assigned
  5202. the Ideographic property, but not the Unified_Ideograph property.</li>
  5203. <li>The newly encoded characters U+16FE1 NUSHU ITERATION MARK and U+11A98 SOYOMBO GEMINATION MARK
  5204. were assigned the binary property Extender.</li>
  5205. <li>A section was added for the 26 regional indicator characters, U+1F1E6..U+1F1FF, which were assigned
  5206. the newly defined binary property Regional_Indicator.</li>
  5207. </ul>
  5208. </li>
  5209. <li>Scripts.txt
  5210. <ul>
  5211. <li>The new characters were assigned appropriate Script property values, including four new values
  5212. for the newly encoded scripts: Masaram_Gondi, Nushu, Soyombo, and Zanabazar_Square.</li>
  5213. <li>The newly encoded emoji were assigned the Script property value Common, in a manner consistent with
  5214. similar characters encoded previously.</li>
  5215. <li>The ideographs in the new block CJK Unified Ideographs Extension F, as well as the 21 added to the main
  5216. CJK Unified Ideographs block, were assigned the Script property Han.</li>
  5217. <li>The newly encoded Japanese hentaigana characters were assigned the Script property value Hiragana,
  5218. as hentaigana are effectively historic variants of Hiragana syllables.</li>
  5219. <li>The Malayalam Garshuni letters in the new Syriac Supplement block were assigned the Script property value Syriac.</li>
  5220. <li>Other script specific characters were assigned respective Script property values:
  5221. Bengali, Bopomofo, Gujarati, Malayalam, and Old_Italic.</li>
  5222. <li>The Script property value of U+061C ARABIC LETTER MARK (ALM) was changed from Common to Arabic,
  5223. the initial value that the character had taken when it was encoded in Unicode 6.3.
  5224. The change was made so the character can have the same effects on digit substitution as regular Arabic letters.</li>
  5225. <li>The change for ALM was the only change in Script property values for existing characters.</li>
  5226. </ul>
  5227. </li>
  5228. <li>ScriptExtensions.txt
  5229. <ul>
  5230. <li>The newly encoded U+1CF7 VEDIC SIGN ATIKRAMA was assigned the Script_Extensions property value {Bengali},
  5231. as the character is attested in Bengali publications while not being script specific, which is typical for Vedic marks.</li>
  5232. <li>The existing character U+11301 GRANTHA SIGN CANDRABINDU was assigned the Script_Extensions property value
  5233. {Grantha Tamil} based on attested use with Tamil for writing Sanskrit.</li>
  5234. <li>The Script_Extensions property value of U+061C ARABIC LETTER MARK (ALM) was changed to {Arabic Syriac Thaana},
  5235. the initial value that the character had taken when it was encoded in Unicode 6.3.
  5236. The Script_Extensions change was made in conjunction with the Script change to Arabic for U+061C.</li>
  5237. </ul>
  5238. </li>
  5239. <li>StandardizedVariants.txt
  5240. <ul>
  5241. <li>All of the variation sequences involving emoji, now known more specifically as emoji presentation sequences
  5242. and text presentation sequences, were moved from StandardizedVariants.txt to the UTS #51 data file
  5243. emoji-variation-sequences.txt. The latter is a new data file accompanying Version 5.0 of UTS #51,
  5244. "Unicode Emoji", whose emoji character repertoire corresponds to Unicode 10.0
  5245. [<a href="../tr41/tr41-21.html#UTS51">UTS51</a>].</li>
  5246. <li>Corrections were made to the labels of several Mongolian standardized variation sequences,
  5247. but without changes to the actual character sequences.</li>
  5248. </ul>
  5249. </li>
  5250. <li>UnicodeData.txt
  5251. <ul>
  5252. <li>Entries were added for the 8,518 new characters, including letters, combining marks, digits, symbols, and punctuation marks.</li>
  5253. <li>The new characters include a total of 7,494 CJK unified ideographs, of which 21 were allocated at the end of
  5254. the CJK Unified Ideographs block, thus changing the last assigned code point in that block from U+9FD5 to U+9FEA.</li>
  5255. <li>The other 7,473 ideographs, allocated in the new block CJK Unified Ideographs Extension F, are in the range
  5256. U+2CEB0..U+2EBE0, written using the syntax for large ranges of characters with algorithmically derived names.</li>
  5257. <li>The newly encoded Nushu ideographs in the range U+1B170..U+1B2FB also have algorithmic names with the prefix
  5258. &quot;NUSHU CHARACTER-&quot;, but were listed individually.</li>
  5259. <li>Among the new nonspacing combining marks, there are 12 which have nonzero Canonical_Combining_Class values.</li>
  5260. <li>The three newly encoded Gujarati nukta characters, U+0AFD..U+0AFF, were assigned the Canonical_Combining_Class property
  5261. value 0 (Not_Reordered) rather than 7 (Nukta), although they were given the Indic_Syllabic_Category property value Nukta.
  5262. The Canonical_Combining_Class assignment is because those characters have specialized use and interact with other
  5263. Gujarati nonspacing marks used for transliteration of Arabic, added in Version 10.0.</li>
  5264. <li>The 11 new Malayalam Garshuni letters have the Bidi_Class property value Arabic_Letter, similar to the existing Syriac letters.</li>
  5265. <li>The new repertoire does not include any cased letters or any characters with nontrivial decomposition mappings.</li>
  5266. <li>There were also no changes in General_Category property values of existing characters in this version.</li>
  5267. </ul>
  5268. </li>
  5269. <li>VerticalOrientation.txt
  5270. <ul>
  5271. <li>This data file, which lists the Vertical_Orientation property values, was formally included in the UCD.</li>
  5272. <li>The newly encoded symbol U+2BD2 GROUP MARK was assigned the Vertical_Orientation property value Rotated.</li>
  5273. <li>All of the code points (assigned characters and unassigned code points) in the following new blocks
  5274. were assigned the Vertical_Orientation property value Upright: Kana Extended-A, Nushu, Soyombo, and Zanabazar Square.</li>
  5275. <li>Other newly encoded characters were assigned Vertical_Orientation property values that did not differ
  5276. from the prior defaults for their code points.</li>
  5277. </ul>
  5278. </li>
  5279. </ul>
  5280. <h4>Unihan Database (Unihan.zip)</h4>
  5281. <ul>
  5282. <li>Unihan_DictionaryIndices.txt
  5283. <ul>
  5284. <li>A few corrections were made in the dictionary index data for a small number of CJK unified ideographs.</li>
  5285. </ul>
  5286. </li>
  5287. <li>Unihan_DictionaryLikeData.txt
  5288. <ul>
  5289. <li><i>Cangjie</i> input codes were added for 40 of the characters in the URO extension area at the end of
  5290. the CJK Unified Ideographs block, and for one character from the CJK Unified Ideographs Extension B block.</li>
  5291. </ul>
  5292. </li>
  5293. <li>Unihan_IRGSources.txt
  5294. <ul>
  5295. <li>kIRG_USource and kRSUnicode field values were added for the 21 new characters encoded in the range U+9FD6..U+9FEA,
  5296. at the end of the CJK Unified Ideographs block.</li>
  5297. <li>IRG source data and kRSUnicode field values were added for the characters in the newly encoded
  5298. CJK Unified Ideographs Extension F block.</li>
  5299. <li>kIRG_GSource field values were added for the 12 unified ideographs in the CJK Compatibility Ideographs block,
  5300. and for a few compatibility ideographs in the same block.</li>
  5301. <li>A correction was made in the kRSUnicode field value of U+7353.</li>
  5302. </ul>
  5303. </li>
  5304. <li>Unihan_RadicalStrokeCounts.txt
  5305. <ul>
  5306. <li>A correction was made in the kRSKangXi field value of U+7353, in coordination with the similar kRSUnicode
  5307. correction in Unihan_IRGSources.txt.</li>
  5308. </ul>
  5309. </li>
  5310. <li>Unihan_Readings.txt
  5311. <ul>
  5312. <li>Many additions, corrections, and other updates were made in kHanyuPinyin and kDefinition field values.</li>
  5313. <li>Other additions and updates were made in a number of kCantonese field values, as well as a few kVietnamese and kMandarin field values.</li>
  5314. </ul>
  5315. </li>
  5316. <li>Unihan_Variants.txt
  5317. <ul>
  5318. <li>The trailing blanks were deleted from a few kSemanticVariant field values, with no other changes in actual data.</li>
  5319. </ul>
  5320. </li>
  5321. </ul>
  5322. <h4>Data for UAX #45</h4>
  5323. <ul>
  5324. <li>USourceData.txt
  5325. <ul>
  5326. <li>Six unencoded CJK ideographs were added as UTC-Source ideographs, with the identifiers UTC-02970 through UTC-02975.</li>
  5327. <li>Minor updates were made in the header, to reflect the encoding of the CJK Unified Ideographs Extension F block.</li>
  5328. </ul>
  5329. </li>
  5330. <li>USourceGlyphs.pdf
  5331. <ul>
  5332. <li>Glyphs were added for the six UTC-Source ideographs introduced in USourceData.txt.</li>
  5333. </ul>
  5334. </li>
  5335. </ul>
  5336. <h4>Conformance Test Data</h4>
  5337. <ul>
  5338. <li>NormalizationTest.txt
  5339. <ul>
  5340. <li>Test cases were added with sequences exercising the 12 newly encoded characters which are
  5341. nonspacing combining marks with nonzero Canonical_Combining_Class property values.</li>
  5342. </ul>
  5343. </li>
  5344. </ul>
  5345. <h4>Auxiliary Data for UAX #14 and UAX #29</h4>
  5346. <ul>
  5347. <li>GraphemeBreakProperty.txt
  5348. <ul>
  5349. <li>Entries were added for the newly encoded characters that were assigned the Grapheme_Cluster_Break property values
  5350. Extend, Prepend, and SpacingMark, according to the derivation expressions of those property values.</li>
  5351. <li>Emoji symbols, both existing and newly encoded, were assigned the Grapheme_Cluster_Break property values
  5352. E_Base and Glue_After_Zwj, based on their classification in Version 5.0 of UTS #51, &quot;Unicode Emoji&quot;
  5353. [<a href="../tr41/tr41-21.html#UTS51">UTS51</a>].</li>
  5354. <li>In particular, the set of Glue_After_Zwj characters includes the old symbols U+2640 FEMALE SIGN,
  5355. U+2642 MALE SIGN, and U+2695 STAFF OF AESCULAPIUS, which have the UTS #51 binary property Emoji=Yes.
  5356. (They were assigned that property in UTR #51 Version 4.0.)</li>
  5357. </ul>
  5358. </li>
  5359. <li>GraphemeBreakTest.txt
  5360. <ul>
  5361. <li>The instances of U+2764 HEAVY BLACK HEART in existing test cases were replaced by U+2640 FEMALE SIGN,
  5362. as sample character for the Grapheme_Cluster_Break class Glue_After_Zwj.</li>
  5363. </ul>
  5364. </li>
  5365. <li>LineBreakTest.txt
  5366. <ul>
  5367. <li>Minor edits were made to the documentation, for clarity and as a result of the removal of the pair table from
  5368. UAX #14, &quot;Unicode Line Breaking Algorithm&quot; [<a href="../tr41/tr41-21.html#UAX14">UAX14</a>].</li>
  5369. </ul>
  5370. </li>
  5371. <li>SentenceBreakProperty.txt
  5372. <ul>
  5373. <li>Entries were added for the newly encoded characters that were assigned the Sentence_Break property values
  5374. Extend, Numeric, OLetter, and STerm, according to the derivation expressions of those property values.</li>
  5375. </ul>
  5376. </li>
  5377. <li>WordBreakProperty.txt
  5378. <ul>
  5379. <li>Entries were added for the newly encoded characters that were assigned the Word_Break property values
  5380. ALetter, Extend, and Numeric, according to the derivation expressions of those property values.</li>
  5381. <li>A set of 34 phonetic modifiers with the General_Category property value Modifier_Symbol were assigned
  5382. the Word_Break property value ALetter. The value was assigned according to the revised definition of ALetter
  5383. in UAX #29, &quot;Unicode Text Segmentation&quot;, by direct assignment, without changing the Alphabetic
  5384. or General_Category properties of the affected characters [<a href="../tr41/tr41-21.html#UAX29">UAX29</a>].</li>
  5385. <li>The Word_Break property value of U+02D7 MODIFIER LETTER MINUS SIGN was changed from MidLetter to ALetter,
  5386. as part of the same reclassification of phonetic modifiers.</li>
  5387. <li>Emoji symbols, both existing and newly encoded, were assigned the Word_Break property values
  5388. E_Base and Glue_After_Zwj, based on their classification in Version 5.0 of UTS #51, &quot;Unicode Emoji&quot;
  5389. [<a href="../tr41/tr41-21.html#UTS51">UTS51</a>].</li>
  5390. <li>In particular, the set of Glue_After_Zwj characters includes the old symbols U+2640 FEMALE SIGN,
  5391. U+2642 MALE SIGN, and U+2695 STAFF OF AESCULAPIUS, which have the UTS #51 binary property Emoji=Yes.
  5392. (They were assigned that property in UTR #51 Version 4.0.)</li>
  5393. </ul>
  5394. </li>
  5395. <li>WordBreakTest.txt
  5396. <ul>
  5397. <li>The instances of U+2764 HEAVY BLACK HEART in existing test cases were replaced by U+2640 FEMALE SIGN,
  5398. as sample character for the Word_Break class Glue_After_Zwj.</li>
  5399. </ul>
  5400. </li>
  5401. </ul>
  5402. <h4>Documentation for Auxiliary Data</h4>
  5403. <ul>
  5404. <li>GraphemeBreakTest.html
  5405. <ul>
  5406. <li>The instances of U+2764 HEAVY BLACK HEART in test cases and chart tooltips were replaced by
  5407. U+2640 FEMALE SIGN, as sample character for the Grapheme_Cluster_Break class Glue_After_Zwj.</li>
  5408. <li>Editorial updates were made to the documentation contained in the file.</li>
  5409. </ul>
  5410. </li>
  5411. <li>LineBreakTest.html
  5412. <ul>
  5413. <li>Minor edits were made to the documentation, for clarity and as a result of the removal of the pair table from
  5414. UAX #14, &quot;Unicode Line Breaking Algorithm&quot; [<a href="../tr41/tr41-21.html#UAX14">UAX14</a>].</li>
  5415. </ul>
  5416. </li>
  5417. <li>SentenceBreakTest.html
  5418. <ul>
  5419. <li>Editorial updates were made to the documentation contained in the file.</li>
  5420. </ul>
  5421. </li>
  5422. <li>WordBreakTest.html
  5423. <ul>
  5424. <li>The instances of U+2764 HEAVY BLACK HEART in test cases and chart tooltips were replaced by
  5425. U+2640 FEMALE SIGN, as sample character for the Word_Break class Glue_After_Zwj.</li>
  5426. <li>Editorial updates were made to the documentation contained in the file.</li>
  5427. </ul>
  5428. </li>
  5429. </ul>
  5430. <hr>
  5431. <h3><a name="Unicode_9.0.0" href="http://www.unicode.org/versions/components-9.0.0.html">Unicode 9.0.0</a></h3>
  5432. <p><b>Changes in specific files:</b></p>
  5433. <p>Appropriate data files were updated to add the 7,500 new characters encoded in Unicode 9.0,
  5434. which consist of 6,881 Tangut characters and 619 other characters.
  5435. Major changes that are most likely to affect implementations are documented
  5436. in <a href="http://www.unicode.org/versions/Unicode9.0.0/#Migration">Section M of the Unicode 9.0.0 page</a>.
  5437. Detailed data file updates resulting from encoding the new characters and from various character
  5438. property changes are summarized below, in the same grouping manner used in
  5439. <a href="http://www.unicode.org/versions/components-9.0.0.html">Components of Unicode 9.0.0</a>.
  5440. <p>Note that minor editorial updates and changes to the derived and extracted data files are not documented here.</p>
  5441. <p>Also note that citations of UTR #51, "Unicode Emoji" in this section refer to UTR #51 prior to
  5442. Version 5.0 [<a href="../tr41/tr41-21.html#UTR51">UTR51</a>].</p>
  5443. <h4>Core Data</h4>
  5444. <ul>
  5445. <li>ArabicShaping.txt
  5446. <ul>
  5447. <li>Entries were added for the newly encoded Arabic letters, as well as the new prefixed format control U+08E2.
  5448. These include three letters used for Warsh orthography, U+08BB..U+08BD, which define their own new joining groups,
  5449. AFRICAN FEH, AFRICAN QAF, and AFRICAN NOON.</li>
  5450. <li>Entries were added for the letters of the newly encoded Adlam script, all of which are dual joining.</li>
  5451. <li>U+202F NARROW NO-BREAK SPACE was explicitly listed for emphasis, because it influences shaping in Mongolian,
  5452. without having changed its joining properties.</li>
  5453. <li>The Joining_Type property value of the Mongolian baluda characters, U+1885 and U+1886, changed to Transparent
  5454. as a result of their reclassification as General_Category=Mn.</li>
  5455. </ul>
  5456. </li>
  5457. <li>Blocks.txt
  5458. <ul>
  5459. <li>A total of 11 new blocks were added, including blocks for the six new scripts and supplemental blocks for three existing scripts, Cyrillic, Glagolitic, and Mongolian.</li>
  5460. <li>The largest script by far in Unicode 9.0, Tangut, spans two dedicated blocks and one character from another new block, Ideographic Symbols and Punctuation.</li>
  5461. </ul>
  5462. </li>
  5463. <li>EastAsianWidth.txt
  5464. <ul>
  5465. <li>The pictographic symbols which have the Emoji_Presentation property as of <a href="../tr51/tr51-7.html">Version 3.0</a>
  5466. of Unicode Technical Report #51, "Unicode Emoji", with the exception of regional indicators, U+1F1E6..U+1F1FF,
  5467. were assigned the East_Asian_Width property value Wide [<a href="../tr41/tr41-21.html#UTR51">UTR51</a>].
  5468. This assignment includes both existing and newly encoded symbols, and ensures consistent treatment of emoji as Wide characters.</li>
  5469. <li>All of the Tangut characters&#x2014;ideographs, components, and the iteration mark U+16FE0&#x2014;were assigned the East_Asian_Width property value Wide.</li>
  5470. <li>Most of the other new characters were assigned the East_Asian_Width property value Neutral.</li>
  5471. </ul>
  5472. </li>
  5473. <li>IndicPositionalCategory.txt
  5474. <ul>
  5475. <li>Entries were added for the matras and non-vocalic marks of the three Brahmi-derived scripts introduced in Unicode 9.0&#x2014;Bhaiksuki, Marchen, and Newa.</li>
  5476. <li>A newly encoded combining mark used with Newa, U+1DFB, was specifically given an Indic_Positional_Category property value.</li>
  5477. <li>Two new marks added to Khojki and Saurashtra were also given Indic_Positional_Category property values.</li>
  5478. </ul>
  5479. </li>
  5480. <li>IndicSyllabicCategory.txt
  5481. <ul>
  5482. <li>Characters in the three newly encoded Brahmi-derived scripts, as well as new characters of existing Indic scripts,
  5483. including Malayalam chillus and Khojki and Saurashtra marks, were added with appropriate property values.</li>
  5484. <li>The rule used to derive the set of characters with the Indic_Syllabic_Category property value Nukta was updated
  5485. to exclude U+1E94A ADLAM NUKTA, as Adlam is not a Brahmi-derived script.</li>
  5486. <li>A few previously encoded Khmer and Myanmar characters, such as the Khamti Shan logograms U+AA74..U+AA76,
  5487. were also assigned specific Indic_Syllabic_Category property values.</li>
  5488. </ul>
  5489. </li>
  5490. <li>LineBreak.txt
  5491. <ul>
  5492. <li>Three new Line_Break property values were introduced, in conjunction with algorithm rules,
  5493. to ensure that the various types of character sequences that represent emoji are handled as indivisible units in line breaking
  5494. [<a href="../tr41/tr41-21.html#UAX14">UAX14</a>, <a href="../tr41/tr41-21.html#UTR51">UTR51</a>].</li>
  5495. <li>Two of the new property values were assigned to characters based on the classification of emoji characters in UTR #51:
  5496. Line_Break=E_Base to the symbols with the UTR #51 binary property Emoji_Modifier_Base,
  5497. and Line_Break=E_Modifier to the characters with the UTR #51 binary property Emoji_Modifier, which consists of the range U+1F3FB..U+1F3FF.
  5498. The affected characters are both existing and new in Unicode 9.0.
  5499. The existing characters that became Line_Break=E_Base had all been Line_Break=Ideographic,
  5500. and the five characters that became Line_Break=E_Modifier had all been Line_Break=Alphabetic.</li>
  5501. <li>The Line_Break property value of U+200D ZERO WIDTH JOINER changed from Combining_Mark to ZWJ,
  5502. the third new Line_Break property value, assigned solely to U+200D.</li>
  5503. <li>For forward compatibility, all of the unassigned code points in the range U+1F000..U+1FFFD,
  5504. whether inside or outside of allocated blocks, were given the default Line_Break property value Ideographic.
  5505. These default values allow better interoperability between applications that support emoji as of different versions of Unicode.</li>
  5506. <li>The Line_Break property values of the halfwidth Katakana and Hangul jamo variants
  5507. in the Halfwidth and Fullwidth Forms block changed from Alphabetic to Ideographic,
  5508. to match the established line breaking behavior of those characters in existing implementations.</li>
  5509. <li>The Line Break property value of the Mongolian baluda characters, U+1885 and U+1886,
  5510. changed from Alphabetic to Combining_Mark as a result of their reclassification as General_Category=Mn.</li>
  5511. <li>The Line_Break property value of U+2764 HEAVY BLACK HEART changed from Alphabetic to Ideographic,
  5512. as a result of its addition to the set of characters with the UTR #51 binary property Emoji.</li>
  5513. <li>Newly encoded characters were assigned appropriate Line_Break property values.</li>
  5514. </ul>
  5515. </li>
  5516. <li>NamedSequences.txt
  5517. <ul>
  5518. <li>Comment lines were spliced in, documenting the named character sequences that had been included
  5519. in the original set of sequences published in Unicode 4.1.</li>
  5520. </ul>
  5521. </li>
  5522. <li>NamedSequencesProv.txt
  5523. <ul>
  5524. <li>The set of 12 named sequences that represent keycaps, used for emoji, remained provisional
  5525. and were modified to include an explicit emoji variation selector U+FE0F in each sequence.
  5526. The insertion was made in accordance with UTR #51, which states that emoji variation selectors
  5527. are used to control the presentation style of emoji characters that have a default text presentation.</li>
  5528. </ul>
  5529. </li>
  5530. <li>NamesList.txt
  5531. <ul>
  5532. <li>Content was updated throughout with new characters, as well as annotations, cross references, subheadings, and remarks.</li>
  5533. </ul>
  5534. </li>
  5535. <li>PropertyAliases.txt
  5536. <ul>
  5537. <li>The long name alias of the binary property STerm was redefined to Sentence_Terminal,
  5538. for name clarity and disambiguation from the Sentence_Break property value STerm.
  5539. Because the short and long name aliases of the binary property had been identical,
  5540. the redefinition of the long alias is equivalent to the introduction of an additional alias.</li>
  5541. <li>An entry was added for the newly defined binary property, Prepended_Concatenation_Mark, abbreviated PCM.</li>
  5542. </ul>
  5543. </li>
  5544. <li>PropertyValueAliases.txt
  5545. <ul>
  5546. <li>The 9.0 value was added to the catalog property Age.</li>
  5547. <li>Script and Block property values were added for the six new scripts and 11 new blocks introduced.</li>
  5548. <li>Entries were added for the new Line_Break, Grapheme_Cluster_Break, and Word_Break property values
  5549. introduced in the corresponding line breaking and text segmentation algorithms for handling emoji sequences.</li>
  5550. <li>Entries were added for the three new Joining_Group property values introduced with the Arabic letters
  5551. U+08BB..U+08BD, used for Warsh orthography.</li>
  5552. <li>A new section was added for the values of the newly defined binary property Prepended_Concatenation_Mark.</li>
  5553. <li>The comment line marking the section for the binary property STerm was updated with the new long property name alias Sentence_Terminal.</li>
  5554. </ul>
  5555. </li>
  5556. <li>PropList.txt
  5557. <ul>
  5558. <li>Most of the newly encoded combining marks were assigned either the contributory property Other_Alphabetic
  5559. or the binary property Diacritic, as appropriate.</li>
  5560. <li>Newly encoded punctuation characters that mark the end of various sections of text, such as dandas,
  5561. were assigned the appropriate binary properties Terminal_Punctuation or Sentence_Terminal,
  5562. with the latter using the new long name alias instead of STerm.</li>
  5563. <li>The Mongolian baluda characters U+1885..U+1886, which were reclassified from General_Category=Lo to Mn,
  5564. were assigned the contributory properties Other_Alphabetic and Other_ID_Start.
  5565. These assignments were made to preserve the Alphabetic and ID_Start properties of the two characters.
  5566. In particular, the preservation of the ID_Start property is dictated by the stability guarantees for Unicode identifiers.</li>
  5567. <li>The newly encoded Tangut ideographs and components were assigned the Ideographic property (but not the Unified_Ideograph property).</li>
  5568. <li>The Tangut iteration mark U+16FE0 and a few Adlam combining marks were assigned the binary property Extender.</li>
  5569. <li>The stateful tag terminator U+E007F CANCEL TAG, formerly deprecated, was reinstated to non-deprecated, for use in emoji contexts.</li>
  5570. <li>A section was added for the set of characters with the newly defined binary property Prepended_Concatenation_Mark.
  5571. The characters with this property, such as U+0600 ARABIC NUMBER SIGN, are also referred to as
  5572. prefixed format control characters or loosely as subtending marks.</li>
  5573. <li>The contributory property Other_Grapheme_Extend was assigned to the tag characters U+E0020..U+E007F
  5574. and was removed for U+200D ZERO WIDTH JOINER (ZWJ). These changes were made to preserve equality between
  5575. the sets of characters with the property values Grapheme_Cluster_Break=Extend and Grapheme_Extend=Y,
  5576. after the addition of tag characters to, and the removal of ZWJ from, the former set.</li>
  5577. </ul>
  5578. </li>
  5579. <li>Scripts.txt
  5580. <ul>
  5581. <li>The new characters were assigned appropriate Script property values, including six new values for
  5582. the newly encoded scripts: Adlam, Bhaiksuki, Marchen, Newa, Osage, and Tangut.</li>
  5583. <li>The newly encoded emoji were assigned the Script property value Common, in a manner consistent with
  5584. similar characters encoded previously.</li>
  5585. <li>There were no changes of Script property values for any existing characters.</li>
  5586. </ul>
  5587. </li>
  5588. <li>ScriptExtensions.txt
  5589. <ul>
  5590. <li>The Script_Extensions property values of over 200 ideographic symbols, which used to contain multiple Script values
  5591. such as Bopomofo, Hangul, Hiragana, Katakana, as well as Han, were reduced to single-script set values, Script_Extensions={Han}.
  5592. See the resolution of <a href="http://www.unicode.org/review/pri316/">PRI&nbsp;#316</a>.</li>
  5593. <li>As Adlam can use U+0640 ARABIC TATWEEL in the cursive form of the script to graphically extend words,
  5594. the Script_Extensions property value of U+0640 was updated to include the Script value Adlam.</li>
  5595. <li>The Script value Kannada was added to the Script_Extensions property values of the North Indic
  5596. fraction signs U+A830..U+A835, attested in Kannada texts.</li>
  5597. <li>The Script_Extensions property values of the Aegean numeral symbols U+10107..U+10133 were updated to include the Script value Linear_A.</li>
  5598. <li>The Script_Extensions property values of other characters used in multiple scripts were updated accordingly.</li>
  5599. </ul>
  5600. </li>
  5601. <li>StandardizedVariants.txt
  5602. <ul>
  5603. <li>A total of 278 emoji variation sequences were added to complete the set of text and emoji presentations
  5604. for all pictographic symbols identified as having a default text presentation [<a href="../tr41/tr41-21.html#UTR51">UTR51</a>].</li>
  5605. <li>Standardized variation sequences were added to complete the set of dotted forms of Myanmar letters for
  5606. Khamti, Aiton, and Phake, to distinguish them from the Burmese and Shan styles. One of the sequences has
  5607. a spacing combining mark as the initial character of the sequence: &lt;U+1031, U+FE00&gt;.</li>
  5608. <li>A standardized variation sequence was added for the slashed-zero form of the empty set symbol, U+2205.
  5609. A separate standardized variation sequence was added for the form with short diagonal stroke of digit 0,
  5610. U+0030, to avoid misuse of the previous sequence for the variant form of the digit.</li>
  5611. </ul>
  5612. </li>
  5613. <li>TangutSources.txt
  5614. <ul>
  5615. <li>This new data file was added to the UCD. It contains source mappings for Tangut ideographs and components,
  5616. as well as radical-stroke data for the ideographs, in the same format as the Unihan data files.</li>
  5617. </ul>
  5618. </li>
  5619. <li>UnicodeData.txt
  5620. <ul>
  5621. <li>Entries were added for the newly encoded characters, including case pairs and cased letters which form
  5622. case pairs with previously encoded letters.</li>
  5623. <li>The additions include 9 historic Cyrillic letters, U+1C80..U+1C88, which have asymmetric case mappings
  5624. to existing uppercase letters, similar to the asymmetric case mapping of Greek final sigma to capital sigma.</li>
  5625. <li>The additions also include a range of Tangut ideographs, U+17000..U+187EC, which uses the same syntax
  5626. as that for large ranges of characters with algorithmically derived names. For Tangut ideographs, the
  5627. derived names are TANGUT IDEOGRAPH-17000 through TANGUT IDEOGRAPH-187EC.</li>
  5628. <li>Among the new nonspacing combining marks, there are 63 which have nonzero Canonical_Combining_Class values.</li>
  5629. <li>One new character, 1F23B SQUARED CJK UNIFIED IDEOGRAPH-914D, has a nontrivial compatibility decomposition mapping.</li>
  5630. <li>The Mongolian baluda characters, U+1885 and U+1886, were reclassified as General_Category=Mn,
  5631. and their Bidi_Class property was updated to Nonspacing_Mark, accordingly.</li>
  5632. </ul>
  5633. </li>
  5634. </ul>
  5635. <h4>Unihan Database (Unihan.zip)</h4>
  5636. <ul>
  5637. <li>Unihan_DictionaryIndices.txt
  5638. <ul>
  5639. <li>Dictionary index data was added for 196 ideographs from the CJK Unified Ideographs Extension E block,
  5640. for the first time since the encoding of Extension E in Unicode 8.0.</li>
  5641. </ul>
  5642. </li>
  5643. <li>Unihan_DictionaryLikeData.txt
  5644. <ul>
  5645. <li>The total stroke count values for the 5,771 CJK unified ideographs encoded in Unicode 8.0, which had been missing from Unihan,
  5646. were entirely populated: 9 ideographs in the main CJK Unified Ideographs block and the rest comprising all of the assigned characters
  5647. in the CJK Unified Ideographs Extension E block.</li>
  5648. <li>A few other total stroke count values were corrected, and one kCihaiT field value was added.</li>
  5649. </ul>
  5650. </li>
  5651. <li>Unihan_IRGSources.txt
  5652. <ul>
  5653. <li>A total of 2,828 kIRG_JSource fields were updated to use the latest source references from the Japanese Industrial Standard
  5654. JIS X 0213:2004, instead of the corresponding legacy references from JIS X 0212-1990 and from Unified Japanese IT Vendors Contemporary Ideographs.</li>
  5655. <li>The values of the residual stroke counts in the kRSUnicode fields of 20 CJK unified ideographs were changed from 0 to negative values.
  5656. A negative value indicates that strokes which would normally constitute the indexing radical are intentionally missing.</li>
  5657. <li>A few kIRG_GSource, kIRG_MSource, and kIRG_USource field values were added, and a couple of kIRG_GSource and kIRG_KSource field values were removed.</li>
  5658. <li>The kRSUnicode fields of a small number of other ideographs were also updated with corrections or additional values.</li>
  5659. </ul>
  5660. </li>
  5661. <li>Unihan_RadicalStrokeCounts.txt
  5662. <ul>
  5663. <li>The kRSKangXi fields of the same CJK unified ideographs whose kRSUnicode fields were changed in Unihan_IRGSources.txt
  5664. (with the exception of one Extension E ideograph, U+2C09B) were similarly changed in Unihan_RadicalStrokeCounts.txt.</li>
  5665. </ul>
  5666. </li>
  5667. <li>Unihan_Readings.txt
  5668. <ul>
  5669. <li>Over 600 kMandarin readings and over 100 kHanyuPinlu field values were updated.</li>
  5670. <li>A few kDefinition and kHangul fields were revised, and a couple of kMandarin and kCantonese readings were added.</li>
  5671. </ul>
  5672. </li>
  5673. <li>Unihan_Variants.txt
  5674. <ul>
  5675. <li>A small number of variant relationship mappings were added or updated.</li>
  5676. </ul>
  5677. </li>
  5678. </ul>
  5679. <h4>Data for UAX #45</h4>
  5680. <ul>
  5681. <li>USourceData.txt
  5682. <ul>
  5683. <li>A total of 1,768 unencoded CJK ideographs were added as U-Source ideographs, with the identifiers UTC-01202 through UTC-02968 and UCI-02969.</li>
  5684. </ul>
  5685. </li>
  5686. <li>USourceGlyphs.pdf
  5687. <ul>
  5688. <li>Glyphs were added for the 1,768 U-Source ideographs introduced in USourceData.txt.</li>
  5689. </ul>
  5690. </li>
  5691. </ul>
  5692. <h4>Conformance Test Data</h4>
  5693. <ul>
  5694. <li>BidiCharacterTest.txt
  5695. <ul>
  5696. <li>Tests were added covering the edge cases of the Unicode Bidirectional Algorithm,
  5697. which were subject to changes and clarifications made in Unicode 8.0, described in detail in
  5698. the background document of <a href="http://www.unicode.org/review/pri279/">PRI&nbsp;#279</a>.</li>
  5699. <li>A few other test cases were added for verifying the resolution of deeply nested bracket pairs,
  5700. at the boundary conditions when the number of nested pairs reaches and exceeds the fixed capacity of the bracket stack.</li>
  5701. </ul>
  5702. </li>
  5703. <li>NormalizationTest.txt
  5704. <ul>
  5705. <li>Test cases were added with sequences exercising all newly encoded characters which are nonspacing combining marks
  5706. with nonzero Canonical_Combining_Class values.</li>
  5707. <li>One test case was added with a sequence containing the single newly encoded character which has a
  5708. nontrivial compatibility decomposition mapping, U+1F23B SQUARED CJK UNIFIED IDEOGRAPH-914D.</li>
  5709. <li>Two extra test cases were added, consisting of character sequences with conjoining Hangul jamo and precomposed Hangul syllables.</li>
  5710. </ul>
  5711. </li>
  5712. </ul>
  5713. <h4>Auxiliary Data for UAX #14 and UAX #29</h4>
  5714. <ul>
  5715. <li>GraphemeBreakProperty.txt
  5716. <ul>
  5717. <li>The Grapheme_Cluster_Break class Prepend, previously empty, was populated with a set of characters,
  5718. according to its new derivation expression [<a href="../tr41/tr41-21.html#UAX29">UAX29</a>].
  5719. The set includes the characters with the newly defined binary property Prepended_Concatenation_Mark,
  5720. which used to be Grapheme_Cluster_Break=Control, as well as a few other characters with the
  5721. Indic_Syllabic_Category property values Consonant_Preceding_Repha and Consonant_Prefixed.</li>
  5722. <li>The newly encoded combining marks were assigned the Grapheme_Cluster_Break property values
  5723. Extend or SpacingMark, largely by derivation from their General_Category property values.</li>
  5724. <li>The Mongolian baluda characters, U+1885 and U+1886, became Grapheme_Cluster_Break=Extend also
  5725. by derivation, following their reclassification as General_Category=Mn.</li>
  5726. <li>The tag characters U+E0020..U+E007F, all of them non-deprecated as of Unicode 9.0, were moved from
  5727. the Grapheme_Cluster_Break class Control to Extend.</li>
  5728. <li>U+200D ZERO WIDTH JOINER, formerly Grapheme_Cluster_Break=Extend, formed a new class by itself,
  5729. Grapheme_Cluster_Break=ZWJ. The new property value is used in the Grapheme Cluster Boundary Algorithm
  5730. for the handling of emoji zwj sequences defined in UTR #51 as indivisible units [<a href="../tr41/tr41-21.html#UTR51">UTR51</a>].</li>
  5731. <li>The pictographic symbols with the UTR #51 binary property Emoji_Modifier_Base formed two newly defined
  5732. Grapheme_Cluster_Break classes, E_Base and E_Base_GAZ. The partitioning is determined by the additional
  5733. presence or absence of those characters in the set of emoji zwj sequences defined in UTR #51.</li>
  5734. <li>Other pictographic symbols that appear in emoji zwj sequences (after ZWJ) but do not have the UTR #51
  5735. binary property Emoji_Modifier_Base formed an additional new class, Grapheme_Cluster_Break=Glue_After_Zwj.</li>
  5736. <li>The characters with the UTR #51 binary property Emoji_Modifier formed the last emoji-related,
  5737. newly defined Grapheme_Cluster_Break class E_Modifier.</li>
  5738. </ul>
  5739. </li>
  5740. <li>GraphemeBreakTest.txt
  5741. <ul>
  5742. <li>Test cases were added exercising the newly populated Grapheme_Cluster_Break class Prepend.</li>
  5743. <li>Test cases were added exercising the newly defined emoji-related Grapheme_Cluster_Break property values
  5744. E_Base, E_Base_GAZ, Glue_After_Zwj, and E_Modifier, also in combinations with the newly factored out ZWJ.</li>
  5745. <li>Test cases were updated to illustrate grapheme cluster boundaries in sequences of regional indicator
  5746. characters, according to the revised Grapheme Cluster Boundary Algorithm: in sequences of more than two,
  5747. regional indicators are kept together in pairs.</li>
  5748. <li>The rule numbers reported in the test results were updated according to the revised Grapheme Cluster Boundary Algorithm.</li>
  5749. </ul>
  5750. </li>
  5751. <li>LineBreakTest.txt
  5752. <ul>
  5753. <li>Many test cases were added exercising the newly defined emoji-related Line_Break property values
  5754. E_Base and E_Modifier, as well as ZWJ.</li>
  5755. <li>The expected test results were updated according to the revised rules of the Unicode Line Breaking Algorithm.</li>
  5756. </ul>
  5757. </li>
  5758. <li>SentenceBreakProperty.txt
  5759. <ul>
  5760. <li>Newly encoded characters were assigned the Sentence_Break property values Extend, Format, Lower,
  5761. Numeric, OLetter, STerm, or Upper, by derivation from their primary property values.</li>
  5762. <li>The Sentence_Break property values of the Mongolian baluda characters, U+1885 and U+1886, changed from
  5763. OLetter to Extend also by derivation, following their reclassification as General_Category=Mn.</li>
  5764. <li>The tag characters U+E0020..U+E007F were moved from the Sentence_Break class Format to Extend.</li>
  5765. </ul>
  5766. </li>
  5767. <li>SentenceBreakTest.txt
  5768. <ul>
  5769. <li>The rule numbers reported in the test results were updated, to reflect the renumbering of one rule
  5770. of the Sentence Boundary Algorithm.</li>
  5771. <li>A few test cases were added, removed, or reordered.</li>
  5772. </ul>
  5773. </li>
  5774. <li>WordBreakProperty.txt
  5775. <ul>
  5776. <li>Newly encoded characters were assigned the Word_Break property values ALetter, Extend, Format, or Numeric,
  5777. by derivation from other property values.</li>
  5778. <li>The Word_Break property values of the Mongolian baluda characters, U+1885 and U+1886, changed from
  5779. ALetter to Extend also by derivation, following their reclassification as General_Category=Mn.</li>
  5780. <li>The tag characters U+E0020..U+E007F, were moved from the Word_Break class Format to Extend.</li>
  5781. <li>The newly introduced Word_Break property values related to emoji&#x2014;E_Base, E_Base_GAZ, Glue_After_Zwj, and
  5782. E_Modifier&#x2014;were assigned to the same sets of pictographic symbols as the similarly named Grapheme_Cluster_Break property values were.</li>
  5783. <li>U+200D ZERO WIDTH JOINER formed a new class by itself, Word_Break=ZWJ, also similar to the Grapheme_Cluster_Break reclassification of U+200D.</li>
  5784. <li>The Word_Break property value of U+202F NARROW NO-BREAK SPACE changed from the default Other to ExtendNumLet.</li>
  5785. </ul>
  5786. </li>
  5787. <li>WordBreakTest.txt
  5788. <ul>
  5789. <li>Test cases were added exercising the newly defined emoji-related Word_Break property values
  5790. E_Base, E_Base_GAZ, Glue_After_Zwj, and E_Modifier, also in combinations with the newly factored out ZWJ.</li>
  5791. <li>Test cases were updated to illustrate word boundaries in sequences of regional indicator characters,
  5792. according to the revised Word Boundary Algorithm.</li>
  5793. </ul>
  5794. </li>
  5795. </ul>
  5796. <h4>Documentation for Auxiliary Data</h4>
  5797. <ul>
  5798. <li>GraphemeBreakTest.html
  5799. <ul>
  5800. <li>The pair table was updated to include the five newly defined Grapheme_Cluster_Break property values&#x2014;E_Base,
  5801. E_Base_GAZ, Glue_After_Zwj, E_Modifier, and ZWJ&#x2014;as well as the existing but now populated class Prepend.</li>
  5802. <li>The test rules were updated to match those in the Grapheme Cluster Boundary Algorithm as defined in
  5803. Unicode Standard Annex #29, "Unicode Text Segmentation" [<a href="../tr41/tr41-21.html#UAX29">UAX29</a>].</li>
  5804. <li>The sample test cases were updated and a few more were added.</li>
  5805. </ul>
  5806. </li>
  5807. <li>LineBreakTest.html
  5808. <ul>
  5809. <li>The pair table was updated to include the three newly defined Line_Break property values&#x2014;E_Base, E_Modifier, and ZWJ.</li>
  5810. <li>The test rules were updated as a result of the changes in the Unicode Line Breaking Algorithm [<a href="../tr41/tr41-21.html#UAX14">UAX14</a>].</li>
  5811. <li>Several sample test cases were added.</li>
  5812. </ul>
  5813. </li>
  5814. <li>SentenceBreakTest.html
  5815. <ul>
  5816. <li>One rule of the Sentence Boundary Algorithm was renumbered, reflecting the same change made in UAX #29.</li>
  5817. <li>A few sample test cases were added, removed, or reordered.</li>
  5818. </ul>
  5819. </li>
  5820. <li>WordBreakTest.html
  5821. <ul>
  5822. <li>The pair table, test rules, and sample test cases were updated in a manner similar to the corresponding updates made in GraphemeBreakTest.html.</li>
  5823. </ul>
  5824. </li>
  5825. </ul>
  5826. <hr>
  5827. <h2 class="nonumber"><a name="Acknowledgments" href="#Acknowledgments">Acknowledgments</a></h2>
  5828. <p>Mark Davis and Ken Whistler are the authors of the initial version and have added to and
  5829. maintained the text of this annex. Laurențiu Iancu assisted
  5830. in the documentation of UCD changes for Versions 6.3.0 through 10.0.0. Julie Allen and Asmus Freytag provided editorial
  5831. suggestions for improvement of the text. Over the years, many
  5832. members of the UTC have participated in the review of the UCD
  5833. and its documentation.</p>
  5834. <h2 class="nonumber"><a name="References" href="#References">References</a></h2>
  5835. <p>For references for this annex, see Unicode Standard Annex #41, "<a href="../tr41/tr41-21.html">Common
  5836. References for Unicode Standard Annexes</a>."</p>
  5837. <h2 class="nonumber"><a name="Modifications" href="#Modifications">Modifications</a></h2>
  5838. <p>The following summarizes modifications from previous revisions of this
  5839. annex.</p>
  5840. <h3>Revision 20 [KW, LI]</h3>
  5841. <ul>
  5842. <li><b>Reiussed</b> for Unicode 10.0.0.</li>
  5843. <li>Removed old UCD Change History entry for Unicode 8.0.0, and added new one for Unicode 10.0.0.</li>
  5844. <li>Updated the description of the <a href="#Name">Name</a> property value.</li>
  5845. <li>Updated the discussion of immutable properties and the list of those properties in
  5846. <a href="#Immutable_Properties_Table">Table 19</a>.</li>
  5847. <li>Added a new Table 10a, <a href="#Contributory_Properties_Table">Contributory Properties</a>
  5848. in Section 5.5.</li>
  5849. <li>Added a row to Table 5, <a href="#UCD_Files_Table">Files in the UCD</a> for
  5850. NushuSources.txt. Tweaked content elsewhere to account for this new addition.</li>
  5851. <li>Added new Section 5.13 <a href="#Property_APIs">Property APIs</a>.</li>
  5852. <li>Updated Table 9, <a href="#Property_List_Table">Property Table</a> to show
  5853. that the <a href="#Ideographic">Ideographic</a> property, rather than the
  5854. Unified_Ideograph property, is now used in the definition of Ideographic Description
  5855. Sequences.</li>
  5856. <li>Added entry for the <a href="#Vertical_Orientation">Vertical_Orientation</a>
  5857. and <a href="#Regional_Indicator">Regional_Indicator</a> properties
  5858. in Table 9, <a href="#Property_List_Table">Property Table</a>.</li>
  5859. <li>Adjusted the discussion of the <a href="#Block">Block</a> property in
  5860. Table 9, <a href="#Property_List_Table">Property Table</a>.</li>
  5861. <li>Added default value for the <a href="#Vertical_Orientation">Vertical_Orientation</a> property
  5862. in Table 4, <a href="#Default_Values_Table">Default Values for Properties</a>
  5863. and added an indication that the default values for Vertical_Orientation are complex.</li>
  5864. <li>Added discussion of new data file DerivedName.txt to Section 5.4,
  5865. <a href="#Derived_Extracted">Derived Extracted Properties</a>.</li>
  5866. <li>Added new Section 2.1.3, <a href="#Props_External">Properties Dependent on External
  5867. Specifications</a> to discuss the dependency of UCD segmentation properties on the
  5868. non-UCD emoji properties.</li>
  5869. <li>Added new Section 5.14, <a href="#Character_Age">Character Age</a> to further explain
  5870. the details of the Age property and its derivation.</li>
  5871. <li>Added column indicating which default values are complex in
  5872. Table 4. <a href="#Default_Values_Table">Default Values for Properties</a>.</li>
  5873. <li>Updated various mentions of "U-Source ideographs" to "UTC-Source ideographs".</li>
  5874. </ul>
  5875. <p>Revision 19 being a proposed update, only changes between revisions 20 and
  5876. 18 are noted here.</p>
  5877. <h3>Revision 18 [KW, LI]</h3>
  5878. <ul>
  5879. <li><b>Reissued</b> for Unicode 9.0.0.</li>
  5880. <li>Removed old UCD Change History entry for Unicode 7.0.0, and added new one for Unicode 9.0.0.</li>
  5881. <li>Updated Section 3.4 <a href="#StandardizedVariants">StandardizedVariants.html</a> to
  5882. document the obsolescence of that file and the alternative means now available for
  5883. displaying reference glyphs for standardized variants.</li>
  5884. <li>Added new Section 3.5 <a href="#EmojiVariants">Emoji Variation Sequences</a> to
  5885. document the page on the emoji subsite showing the glyphs for the emoji variation sequences.</li>
  5886. <li>Updated documentation for <a href="#STerm">Sentence_Terminal</a> to use the long alias.</li>
  5887. <li>Updated documentation for <a href="#Ideographic">Ideographic</a> and
  5888. <a href="#Unified_Ideograph">Unified_Ideograph</a> to clarify their relationship.</li>
  5889. <li>Added a row to Table 5, <a href="#UCD_Files_Table">Files in the UCD</a> for
  5890. TangutSources.txt. Tweaked content elsewhere to account for this new addition.</li>
  5891. <li>Added clarification in Section 5.7.5
  5892. <a href="#Decompositions_and_Normalization">Decompositions and Normalization</a>
  5893. regarding which normalization-related properties should or should not be exported
  5894. in an API.</li>
  5895. <li>Added note in Section 5.12 <a href="#Deprecation">Deprecation</a>
  5896. indicating that deprecated properties are not recommended for support in APIs.</li>
  5897. <li>Added documentation for <a href="#Prepended_Concatenation_Mark">Prepended_Concatenation_Mark</a>.</li>
  5898. <li>Updated statement about default values for the Line_Break property in
  5899. Section 4.2.9 <a href="#Default_Values">Default Values</a>.</li>
  5900. </ul>
  5901. <p>Revision 17 being a proposed update, only changes between revisions 18 and
  5902. 16 are noted here.</p>
  5903. <h3>Revision 16 [KW, LI]</h3>
  5904. <ul>
  5905. <li><b>Reissued</b> for Unicode 8.0.0.</li>
  5906. <li>Removed old UCD Change History entry for Unicode 6.3.0, and added new one for Unicode 8.0.0.</li>
  5907. <li>Clarified the intent for the information contained in <a href="#Property_List_Table">Table 9</a>
  5908. in Section 5.3 <a href="#Property_Definitions">Property Definitions</a>.</li>
  5909. <li>Updated table styles.</li>
  5910. <li>Renamed Indic_Matra_Category to <a href="#Indic_Positional_Category">Indic_Positional_Category</a>, with corresponding change in the file name.</li>
  5911. <li>Changed <a href="#Indic_Syllabic_Category">Indic_Syllabic_Category</a> and the renamed
  5912. <a href="#Indic_Positional_Category">Indic_Positional_Category</a> from Provisional to Informative status.</li>
  5913. <li>Added information about location of UCD.zip and the URL for zipped/latest.</li>
  5914. </ul>
  5915. <p>Revision 15 being a proposed update, only changes between revisions 16 and
  5916. 14 are noted here.</p>
  5917. <h3>Revision 14 [KW, LI]</h3>
  5918. <ul>
  5919. <li><b>Reissued</b> for Unicode 7.0.0.</li>
  5920. <li>Removed old UCD Change History entry for Unicode 6.2.0, and added new one for Unicode 7.0.0.</li>
  5921. <li>Updated chapter references for Unicode 7.0.0.</li>
  5922. <li>Updated the derivation of the <a href="#Alphabetic">Alphabetic</a> property.</li>
  5923. <li>Updated the derivation of the <a href="#Case_Ignorable">Case_Ignorable</a> property.</li>
  5924. <li>Simplified the discussion of @missing in Section 4.2.10 <a href="#Missing_Conventions">@missing Conventions</a>,
  5925. to reflect the revised conventions in the UCD data files, which eliminated special edge cases.</li>
  5926. <li>Corrected statement about aliases for provisional properties in Section 5.8
  5927. <a href="#Property_Aliases">Property and Property Value Aliases</a>.</li>
  5928. <li>Minor editing.</li>
  5929. </ul>
  5930. <p>Revision 13 being a proposed update, only changes between revisions 14 and
  5931. 12 are noted here.</p>
  5932. <h3>Revision 12 [KW, LI]</h3>
  5933. <ul>
  5934. <li><b>Reissued</b> for Unicode 6.3.0.</li>
  5935. <li>Removed old UCD Change History entry for Unicode 6.1.0, and added new one for Unicode 6.3.0.</li>
  5936. <li>Added a clarification about <a href="#Numeric_Type">Numeric_Type</a>=Digit.</li>
  5937. <li>Added documentation of default values for Line_Break, added additional default values
  5938. for Bidi_Class, and clarified the usage of @missing in Section 4.2.9 <a href="#Default_Values">Default Values</a>.</li>
  5939. <li>Added new Section 4.2.10 <a href="#Missing_Conventions">@missing Conventions</a>, to spell out
  5940. syntax and other issues for @missing lines in more detail.</li>
  5941. <li>Clarified the status of default values in Section 5.4 <a href="#Derived_Extracted">Derived Extracted Properties</a>.</li>
  5942. <li>Added information about the derived status of kCompatibilityVariant in Section 5.7.3
  5943. <a href="#Character_Decomposition_Mappings">Character Decomposition Mapping</a>.</li>
  5944. <li>Added an entry for BidiBrackets.txt and two new bidi properties to <a href="#Property_List_Table">Table 9. Property Table</a>
  5945. and relevant links elsewhere.</li>
  5946. <li>Added BidiCharacterTest.txt to the list of test data files and provided a brief description of its contents in
  5947. Section 6.3 <a href="#BidiTest_txt">Bidirectional Test Files</a>.</li>
  5948. <li>Added new isolate controls to <a href="#BC_Values_Table">Table 13. Bidi_Class Values</a> and reordered
  5949. entries to match the listing in UAX #9.</li>
  5950. <li>Added documentation about the new permalink for the latest UCD release, in Section 4.1
  5951. <a href="#Directory_Structure">Directory Structure</a>.</li>
  5952. </ul>
  5953. <p>Revision 11 being a proposed update, only changes between revisions 12 and
  5954. 10 are noted here.</p>
  5955. <h3>Revision 10 [KW]</h3>
  5956. <ul>
  5957. <li><b>Reissued</b> for Unicode 6.2.0.</li>
  5958. <li>Removed old UCD Change History entry for Unicode 6.0.0, and added new one for Unicode 6.2.0.</li>
  5959. <li>Updated status of <a href="#Script_Extensions">Script_Extensions</a> to informative.</li>
  5960. <li>Updated type of <a href="#Bidi_Mirroring_Glyph">Bidi_Mirroring_Glyph</a>
  5961. from String to Miscellaneous.</li>
  5962. <li>Marked <a href="#Unicode_1_Name">Unicode_1_Name</a> as Obsolete and updated its documentation.</li>
  5963. <li>Added text indicating that the UTC must approve any change to normative or informative
  5964. property values, in Section 2.3.1 <a href="#Allowed_Changes">Changes to Properties Between Releases</a>.</li>
  5965. <li>Corrected numbering error for Section 2.3.4 <a href="#Stabilized_Properties">Stabilized Properties</a>.</li>
  5966. <li>Updated the note about NamesList.txt being encoded in Latin-1, because starting with Version 6.2.0, it
  5967. is encoded in UTF-8. See Section 4.2.11 <a href="#Text_Encoding">Text Encoding</a>.</li>
  5968. <li>Added indication that ccc=133 is reserved in Section 5.11.2
  5969. <a href="#Validation_of_CCC">Combining_Character_Class Property</a>.</li>
  5970. <li>Added Section 3.6 <a href="#USource">U-Source Ideographs and UAX #45</a>.</li>
  5971. <li>Added entries to <a href="#UCD_Files_Table">Table 5</a> for USourceData.txt and USourceGlyphs.pdf.</li>
  5972. <li>Removed entry for ScriptExtensions.txt from <a href="#UCD_Files_Table">Table 5</a>.</li>
  5973. </ul>
  5974. <p>Revision 9 being a proposed update, only changes between revisions 10 and
  5975. 8 are noted here.</p>
  5976. <h3>Revision 8 [KW]</h3>
  5977. <ul>
  5978. <li><b>Reissued</b> for Unicode 6.1.0.</li>
  5979. <li>Removed old UCD Change History entry for Unicode 5.2.0, and added new one for Unicode 6.1.0.</li>
  5980. <li>Added details of data file changes for Unicode 6.1.0.</li>
  5981. <li>Updated derivation of <a href="#Default_Ignorable_Code_Point">Default_Ignorable_Code_Point</a> to account for U+0604.</li>
  5982. <li>Added a clarification about empty field values in data files for string properties
  5983. in a new Section 4.2.10 <a href="#Empty_Fields">Empty Fields</a>.</li>
  5984. <li>Added a warning about matching alternative, non-standard names in Section 5.9
  5985. <a href="#Matching_Rules">Matching Rules</a>.</li>
  5986. <li>Added new Section 4.2.8 <a href="#Multiple_Values">Multiple Values for Properties</a>.</li>
  5987. <li>Added new Section 5.7.6 <a href="#Property_Values_As_Sets">Properties Whose Values Are Sets of Values</a>.</li>
  5988. <li>Added documentation of symbolic labels for fixed position canonical combining classes
  5989. in <a href="#CCC_Values_Table">Table 15</a>.</li>
  5990. <li>Updated wording regarding addition of new property values in Section 5.10 <a href="#Invariants">Invariants</a>.
  5991. <li>Corrected URL for the Resolved PRI page reference.</li>
  5992. <li>Added a paragraph about aliases of the form "Ccc10" for fixed position classes
  5993. in <a href="#Canonical_Combining_Class_Values">Canonical Combining Class Values</a>.</li>
  5994. <li>Clarified the current status of the "n/a" metavalue for PropertyValueAliases.txt, in
  5995. <a href="#Property_Aliases">Property and Property Value Aliases</a>.</li>
  5996. <li>Updated regex in <a href="#Common_Subexpressions_Table">Table 20</a> and <a href="#Regular_Expressions_Table">Table 21</a>.</li>
  5997. <li>Updated the description of the <a href="#Name_Alias">Name_Alias</a> property, to account for new types of formal name
  5998. aliases now included in NameAliases.txt.</li>
  5999. <li>Added new Section 5.11.5 <a href="#Validation_of_Multivalued">Validation of Multivalued Properties</a>.</li>
  6000. <li>Added new entry for <a href="#Script_Extensions">Script_Extensions</a> in the Property Table.</li>
  6001. <li>Updated <a href="#Invariants_in_Implementations">Invariants in Implementations</a> and related
  6002. sections to reflect change in range for Canonical_Combining_Class from 0..255 to 0..254.</li>
  6003. <li>Added note to <a href="#Validation_of_CCC">Combining_Character_Class Property</a> regarding
  6004. implementation use of reserved value 255.</li>
  6005. <li>Added a gray background to entries for contributory properties in the
  6006. <a href="#Property_Index">Property Index</a>.</li>
  6007. <li>Added documentation regarding abbreviations and long aliases for General_Category groupings
  6008. in <a href="#GC_Values_Table">Table 12. General_Category Values</a>.</li>
  6009. <li>Corrected several numerical references to definitions related to casing properties in
  6010. <a href="#Property_List_Table">Table 9. Property Table</a>.</li>
  6011. <li>Added information regarding longest canonical and compatibility mappings in
  6012. <a href="#Character_Decomposition_Mappings">5.7.3 Character Decomposition Mapping</a>.</li>
  6013. <li>Updated status of Grapheme_Base and Grapheme_Extend to normative and corrected their
  6014. descriptions in <a href="#Property_List_Table">Table 9. Property Table</a>.</li>
  6015. <li>Added clarification regarding edge case treatment for Other_Punctuation,
  6016. Other_Symbol, etc. in <a href="#General_Category_Values">5.7.1 General Category Values</a></li>
  6017. <li>Added a description and example of the form of derived property definitions in
  6018. <a href="#Simple_Derived">2.1 Simple and Derived Properties</a>.</li>
  6019. <li>Various small editorial fixes.</li>
  6020. </ul>
  6021. <p>Revision 7 being a proposed update, only changes between revisions 8 and
  6022. 6 are noted here.</p>
  6023. <h3>Revision 6 [KW]</h3>
  6024. <ul>
  6025. <li><b>Reissued</b> for Unicode 6.0.0.</li>
  6026. <li>Removed old UCD Change History entries prior to Unicode 5.2.0.</li>
  6027. <li>Updated status of <a href="#Hyphen">Hyphen</a> and <a href="#ISO_Comment">ISO_Comment</a> properties to Deprecated.</li>
  6028. <li>Updated status of several derived normalization properties to Deprecated.</li>
  6029. <li>Added tables listing <a href="#Deprecated_Property_Table">Deprecated</a> and <a href="#Stabilized_Property_Table">Stabilized</a> properties.</li>
  6030. <li>Extended the discussion of the significance of the <a href="#Bidi_Mirroring_Glyph">Bidi_Mirroring_Glyph</a> property.</li>
  6031. <li>Clarified the intended application of the <a href="#Ideographic">Ideographic</a>
  6032. and <a href="#Unified_Ideograph">Unified_Ideograph</a> properties.</li>
  6033. <li>Moved Property Summary to top of Section 5, renamed it to Property Index,
  6034. and adjusted Section 5 numbering.</li>
  6035. <li>Renumbered tables to account for two table insertions.</li>
  6036. <li>Rewrote the description of the <a href="#Logical_Order_Exception">Logical_Order_Exception</a>
  6037. and <a href="#White_Space">White_Space</a> properties for clarity.</li>
  6038. <li>Added clarification for <a href="#UAX44-LM2">UAX44-LM2</a> in <a href="#Matching_Rules">Matching Rules</a>.</li>
  6039. <li>Updated matching rule <a href="#UAX44-LM3">UAX44-LM3</a> to ignore initial "is" in <a href="#Matching_Rules">Matching Rules</a>.</li>
  6040. <li>Added U+110BD to the list of exceptions to the derivation of <a href="#Default_Ignorable_Code_Point">Default_Ignorable_Code_Point</a>.</li>
  6041. <li>Added anchors to the matching rules.</li>
  6042. <li>Updated the description fields for <a href="#FC_NFKC_Closure">FC_NFKC_Closure</a>
  6043. and <a href="#NFKC_Casefold">NFKC_Casefold</a>.
  6044. <li>Added entries for EmojiSources.txt and ScriptExtensions.txt to <a href="#UCD_Files_Table">Table 5</a>.</li>
  6045. <li>Added entries for <a href="#Indic_Syllabic_Category">Indic_Syllabic_Category</a> and
  6046. <a href="#Indic_Matra_Category">Indic_Matra_Category</a>.</li>
  6047. <li>Added note clarifying that aliases are not provided for provisional properties in <a href="#Property_Aliases">Section 5.8</a>.</li>
  6048. <li>Added clarification on value ranges and other restrictions for decimal digits in
  6049. discussion of <a href="#Numeric_Type">Numeric_Type</a>.</li>
  6050. <li>Miscellaneous minor point edits.</li>
  6051. </ul>
  6052. <p>Revision 5 being a proposed update, only changes between revisions 6 and
  6053. 4 are noted here.</p>
  6054. <h3>Revision 4 [KW]</h3>
  6055. <ul>
  6056. <li><b>Reissued</b> for Unicode 5.2.0.</li>
  6057. <li>Completely reorganized and rewritten, to include all the content
  6058. from the obsoleted <a href="http://www.unicode.org/Public/5.1.0/ucd/UCD.html">UCD.html</a>.</li>
  6059. <li>Added Section 5.10 re deprecation.</li>
  6060. <li>Added subsection in Section 4.2 re line termination conventions.</li>
  6061. <li>Added Contributory as a formal status and updated the Property Table accordingly.</li>
  6062. <li>Added note in Section 5.3.1 to indicate that
  6063. contributory properties are neither normative nor informative.</li>
  6064. <li>Updated documentation for default values.</li>
  6065. <li>Cleaned up description of numeric properties.</li>
  6066. <li>Tweaked the description of NamesList.html.</li>
  6067. <li>Miscellaneous minor point edits.</li>
  6068. <li>Updated summary statement of the document.</li>
  6069. <li>Centered tables.</li>
  6070. <li>Added anchors and numbers to tables and adjusted text referencing tables accordingly.</li>
  6071. <li>Added clarifications about exceptional format issues for Unihan data files.</li>
  6072. <li>Updated references to <i>Section 4.8, Name&#x2014;Normative</i> for
  6073. derived names and for code point labels.</li>
  6074. <li>Added mention of property aliases from Unihan data files to Section 5.6.1.</li>
  6075. <li>Added documentation for new derived properties: Cased, Case_Ignorable,
  6076. Changes_When_Lowercased,
  6077. Changes_When_Uppercased, Changes_When_Titlecased, Changes_When_Casefolded, Changes_When_Casemapped,
  6078. NFKC_Casefold, and Changes_When_NFKC_Casefolded.</li>
  6079. <li>Added strong pointers to Section 3.5 and Chapter 4 of [Unicode] in the Introduction.</li>
  6080. <li>Added new <i>Section 2.3.1, Changes to Properties Between Releases</i>.</li>
  6081. <li>Updated default values for East_Asian_Width.</li>
  6082. <li>Clarified the applicability of comments in cases where properties have multiple
  6083. default values.</li>
  6084. <li>Restructured Section 5.1 documentation of columns in the property table, for better
  6085. text flow.</li>
  6086. <li>Reordered entries for DerivedCoreProperties.txt in the property table, for clarity.</li>
  6087. <li>Added documentation of new test file: BidiTest.txt.</li>
  6088. <li>Updated terminology related to the Unihan Database.</li>
  6089. <li>Added documentation for the new data file, CJKRadicals.txt.</li>
  6090. <li>Added Attached_Above for ccc=214 in Table 13.</li>
  6091. <li>Complete revision of Validation section and associated tables.</li>
  6092. <li>Minor revision of text in <i>Section 4.1.5, File Directory Differences for Early Releases</i>.</li>
  6093. <li>Added a cautionary note about the use of the Age property in regular expressions.</li>
  6094. <li>Added sections explaining obsolete, deprecated, and stabilized properties, and
  6095. clearly identified existing such properties in the property table.</li>
  6096. </ul>
  6097. <p>Revision 3 being a proposed update, only changes between revisions 4 and
  6098. 2 are noted here.</p>
  6099. <h3>Revision 2</h3>
  6100. <ul>
  6101. <li>Initial approved version for Unicode 5.1.0.</li>
  6102. </ul>
  6103. <h3>Revision 1</h3>
  6104. <ul>
  6105. <li>Initial draft.</li>
  6106. </ul>
  6107. <hr>
  6108. <p class="copyright">© 2017 Unicode, Inc. All Rights Reserved. The Unicode
  6109. Consortium makes no expressed or implied warranty of any kind, and assumes no liability for errors
  6110. or omissions. No liability is assumed for incidental and consequential damages in connection with
  6111. or arising out of the use of the information or programs contained or accompanying this technical
  6112. report. The Unicode <a href="http://www.unicode.org/copyright.html">Terms of Use</a> apply.</p>
  6113. <p class="copyright">Unicode and the Unicode logo are trademarks of Unicode, Inc., and are
  6114. registered in some jurisdictions.
  6115. </div> <!-- body -->
  6116. </body>
  6117. </html>